Browser Using Agent (BUA)
Build a browser-using agent that can perform tasks on your behalf on the web
Last updated
Build a browser-using agent that can perform tasks on your behalf on the web
Last updated
Overview
Browser-Using Agent echoes the model popularized by OpenAI but extends it to browser environments.
Traditional OpenAI-like CUA models usually combine vision capabilities of LLMs and reasoning systems to simulate controlling computer interfaces and perform tasks. Browser-Using Agents focus exclusively on the browser as the primitive interface for the agent to interact with. The reason for this is that browsers are a special type of computer interface where the performance of AI agents can be greatly improved by being given access to the DOM of the page.
BUA is available through the bua/completions
endpoint.
Input-wise, on top of the traditional CUA Screenshot + Prompt approach, BUA also leverages the DOM of the page for improved understanding and reasoning of web pages. This is explained in the figure below.
Send a request to `bua/completions`
Include the computer tool as part of the available tools, specifying the display size and environment. You can also include in the first request a screenshot of the initial state of the environment.
Receive a response from the BUA model
The response will contain a list of actions to take to make progress towards the specified goal. These actions could be clicking at a given position, typing in text, scrolling, or even waiting.
Execute the requested action
Execute through code the corresponding action on your browser environment.
Capture the updated state
After executing the action, capture the updated state of the environment as a screenshot.
Send a new request with the updated state as a computer_call_output, and repeat this loop until the model stops requesting actions or you decide to stop.
Before you can use BUA, you require a browser environment that can capture screenshots and DOM snapshots of a given web page. We advise using playwright
for this purpose.
computer.screenshot()
computer.dom()
The first request will contain the initial state of the environment, which is a screenshot of the page and the DOM of the page.
The response will contain a list of actions to take to make progress towards the specified goal. These actions could be clicking at a given position, typing in text, scrolling, or even waiting.
How you map a browser call to actions through code depends on your environment. If you are using playwright
as your browser automation library, we already have a library that maps the browser calls to playwright actions:
You can check out the library for an example implementation, in particular: