Reverse Engineering Antigravity's Browser Automation

Google launched Antigravity IDE on November 18, 2025. It's a fork of VS Code with AI agents built in. The agents can write code, edit files, run terminal commands, and control a browser. I got access to the public preview and tried it out.

What struck me was the browser integration. Cursor has something similar—agents that can interact with web pages—but Antigravity's implementation felt different. When you ask it to do something in a browser, a Chrome window opens, the agent navigates and clicks around, and everything gets recorded as a video artifact. I wanted to know how it worked.

So I treated it as a black box and worked backwards.

The Entry Point: The Agent's Tools

The investigation started with the agent itself. I asked it to look at its own system instructions, and it revealed this tool definition:

Tool: browser_subagent Start a browser subagent to perform actions in the browser with the given task description. The subagent has access to tools for both interacting with web page content (clicking, typing, navigating, etc) and controlling the browser window itself (resizing, etc). Please make sure to define a clear condition to return on. After the subagent returns, you should read the DOM or capture a screenshot to see what it did. Note: All browser interactions are automatically recorded and saved as WebP videos to the artifacts directory. This is the ONLY way you can record a browser session video/animation. IMPORTANT: If the subagent returns that the open_browser_url tool failed, there is a browser issue that is out of your control. You MUST ask the user how to proceed and use the suggested_responses tool. Parameters

TaskName (STRING)
Description: Name of the task that the browser subagent is performing. This is the identifier that groups the subagent steps together, but should still be a human readable name. This should read like a title, should be properly capitalized and human readable, example: 'Navigating to Example Page'. Replace URLs or non-human-readable expressions like CSS selectors or long text with human-readable terms like 'URL' or 'Page' or 'Submit Button'. Be very sure this task name represents a reasonable chunk of work. It should almost never be the entire user request. This should be the very first argument.
Task (STRING)
Description: A clear, actionable task description for the browser subagent. The subagent is an agent similar to you, with a different set of tools, limited to tools to understand the state of and control the browser. The task you define is the prompt sent to this subagent. Avoid vague instructions, be specific about what to do and when to stop. This should be the second argument.
RecordingName (STRING)
Description: Name of the browser recording that is created with the actions of the subagent. Should be all lowercase with underscores, describing what the recording contains. Maximum 3 words. Example: 'login_flow_demo'
waitForPreviousTools (BOOLEAN)
Description: If true, wait for all previous tool calls from this turn to complete before executing (sequential). If false or omitted, execute this tool immediately (parallel with other tools).

This was the first clue. The tool isn't a direct command like `click()` or `type()`. It's a request to start a sub-agent. The main agent delegates the high-level goal ("Go to Google") to this sub-agent, and it handles the details.

So the question became: What is this sub-agent? And where does it live? Armed with the tool definition, I turned to the terminal to find the running processes backing this capability.

Chapter 1: The Black Box

My first step was standard reconnaissance. If there's a browser window open, there must be a process running it. I ran ps aux | grep Chrome and found the smoking gun immediately:

Standard Chrome, but with remote debugging enabled on port 9222. This is the Chrome DevTools Protocol (CDP) interface. I verified it was listening:

An MCP (Model Context Protocol) server package named @agentdeskai/browser-tools-mcp. But the MCP server had to be getting instructions from somewhere. I checked what spawned it:

Port 53410 was the extension server—the API endpoint for tool execution. Ports 53412-53413 were standard LSP channels for code intelligence. Port 53422 was an additional service channel. This was Antigravity's custom coordination server handling both code features and agent orchestration. The command-line flags revealed it connects to Google's internal infrastructure (jetski-server.corp.goog) and uses CSRF tokens for authentication.

Chapter 2: Cracking the Binary

The Language Server was a compiled binary, which usually means a dead end. But I decided to try a classic reverse-engineering trick: the strings command.

The output was a goldmine. It contained specific Go handlers for every browser action:

"Jetski". That was the internal codename. It was a collection of granular handlers. The browser_subagent_handler.go seemed to be the brain, while others handled specific motor functions like clicking elements, scrolling, capturing screenshots, and reading page content.

This proved that the server maintains a strongly typed internal representation of the browser tools before serializing them for the LLM. Each tool had its own argument structure, suggesting a well-defined API boundary between the language server and the browser automation layer.

Deeper analysis revealed the presence of ToolConverter logic within google3/third_party/jetski/cortex/tools/. This implies a translation layer: when the "Jetski" sub-agent decides to click or type, it calls an internal function that passes through these ToolConverters, which validate the arguments (checking PrerequisiteArgumentNames) and likely handle partial parsing for streaming responses.

The ToolConverter Architecture

Extracting function names from the binary revealed the complete ToolConverter system. I searched for method names containing "ToolConverter" and found the full function signatures:

Each browser tool has its own dedicated converter class implementing three core methods:

GetToolDefinition() returns the JSON schema that describes the tool to the LLM (parameters, types, descriptions). ToolCallToCortexStep() converts the LLM's tool call JSON into an internal Cortex step representation. GetPayloadCase() determines which protobuf message type to use for serialization.

To find all the ToolConverter class names, I searched for pointer type signatures:

Each converter also has a corresponding StringConverter for converting internal steps back into text format that the LLM can read in conversation history. I found these by searching for "StringConverter" patterns:

These StringConverters handle the reverse transformation—converting internal Cortex steps into natural language text that gets included in the LLM's conversation context:

This dual-layer architecture—ToolConverters for LLM→internal translation and StringConverters for internal→LLM text (for conversation history)—ensures type safety at every boundary. The system validates tool calls before execution, handles partial parsing for streaming responses, and converts completed steps back into natural language that the LLM can read in subsequent turns.

The Complete Browser Tool Arsenal

By extracting strings from the binary and cross-referencing with the MCP server code, I was able to reconstruct the complete set of browser tools available to the sub-agent:

Browser Navigation Tools: 1. browser_navigate (or open_browser_url) - Description: "Open a URL in Jetski Browser to view the page contents of a URL in a rendered format. You can also use this tool to navigate to different URLs or reload the current page." - Parameters: url (STRING) - The URL to navigate to 2. read_browser_page - Description: "Read browser page in Jetski Browser" / "Get the DOM tree of an open page in the Jetski Browser. Returns only interactive elements and text within the current viewport, each with an index for interaction. If an element is not included, it may be outside the viewport or getting filtered for other reasons - refer to the screenshot to confirm. Then try read_browser_page and browser_scroll tools." - Parameters: page_id (STRING, optional) - The page ID to read Browser Interaction Tools: 3. browser_click_element - Description: Click on an element in the browser by its index - Parameters: - element_index (INTEGER) - Index of the element from the DOM tree - page_id (STRING, optional) - The page ID 4. browser_select_option - Description: Select an option in a dropdown/select element - Parameters: - element_index (INTEGER) - Index of the select element - option_value (STRING) - Value to select - page_id (STRING, optional) 5. browser_press_key - Description: Press a keyboard key - Parameters: - key (STRING) - Key to press (e.g., "Enter", "Escape", "ArrowLeft") - page_id (STRING, optional) Browser Scrolling Tools: 6. browser_scroll - Description: "A tool used to scroll on an element or the page in the browser. For vertical scroll, dy is automatically set to the height of the element/page. For horizontal scroll, dx the width of the element/page. Will output the number of pixels scrolled, indicating 0 pixels if no scrolling occurred." - Parameters: - element_index (INTEGER, optional) - Index of element to scroll, or omit for page scroll - direction (STRING, optional) - "up", "down", "left", "right" - dx (INTEGER, optional) - Horizontal scroll distance - dy (INTEGER, optional) - Vertical scroll distance - page_id (STRING, optional) 7. browser_scroll_up - Description: Scroll up on the page or element - Parameters: - element_index (INTEGER, optional) - page_id (STRING, optional) 8. browser_scroll_down - Description: Scroll down on the page or element - Parameters: - element_index (INTEGER, optional) - page_id (STRING, optional) Browser Window Management: 9. browser_resize_window - Description: Resize the browser window - Parameters: - width (INTEGER) - New window width - height (INTEGER) - New window height Browser Capture Tools: 10. capture_browser_screenshot - Description: Capture a screenshot of the current browser page - Parameters: - page_id (STRING, optional) - The page ID to capture 11. execute_browser_javascript - Description: Execute JavaScript code in the browser context - Parameters: - code (STRING) - JavaScript code to execute - page_id (STRING, optional) Browser Page Management: 12. list_browser_pages - Description: "List all open pages in Jetski Browser and their metadata (page_id, url, title, viewport size, etc.)" - Parameters: None

The binary confirmed the connection to Google's internal "jetski-server" infrastructure that I'd seen in the command-line flags. The presence of template_provider paths suggests that prompts are loaded from template files, explaining the fragmented nature of the prompt strings.

Chapter 3: The Soul (The Reconstructed Prompt)

If "Jetski" was the body, what was the soul? I wanted to find the system prompt—the text that tells the AI who it is.

Found the main persona. But I needed the specific instructions for the browser agent. I searched for "Jetski Browser":

These looked like internal log messages or "thoughts" the agent emits. Then I found the tool definitions themselves:

The prompt wasn't a single contiguous block of text I could extract. Instead, it was fragmented—compiled as individual string literals scattered throughout the binary. The Language Server likely assembles these pieces dynamically at runtime to construct the full system prompt. This explains why a simple strings dump didn't reveal a neat "You are Jetski..." paragraph.

By extracting all relevant strings and cross-referencing with the template provider paths found in the binary (google3/third_party/jetski/prompt/template_provider/templates/system_prompts/), I was able to reconstruct a more complete picture of the system prompt:

Core Identity: "You are an expert AI coding assistant and are pair programming with a USER to solve a coding task. When asked, you focus on outlining the USER's main goals and anticipating likely next steps they will take." Browser Agent Context: You are operating within the "Jetski Browser" context. This is a specialized browser automation environment where you have access to browser-specific tools for interacting with web pages. Browser Capabilities: - "Open a URL in Jetski Browser to view the page contents of a URL in a rendered format. You can also use this tool to navigate to different URLs or reload the current page." - "Get the DOM tree of an open page in the Jetski Browser. Returns only interactive elements and text within the current viewport, each with an index for interaction. If an element is not included, it may be outside the viewport or getting filtered for other reasons - refer to the screenshot to confirm. Then try read_browser_page and browser_scroll tools." - "List all open pages in Jetski Browser and their metadata (page_id, url, title, viewport size, etc.)" Tool Usage Guidelines: - "Act as if the tool calls will be executed immediately after your message, and your next response will have access to their results." - "Formulate your tool calls using the xml and json format specified for each tool." - "The tool arguments should be in a valid json inside of the xml tags." - "The tool name should be the xml tag surrounding the tool call." - "You are REQUIRED to call a tool in your response." Error Handling & Recovery: - "You may have seen the following lint errors as feedback for a previous edit, but they still exist at this point. Please respond accordingly, erring toward explicitness." - "There was a problem parsing the tool call. Error Message: %v. Guidance: You are trying to correct your previous tool call error, you must focus on fixing the failed tool call with sequential tool calls and try again. Do not do parallel tool calls and if you are fixing multiple tool calls, do them one at a time. Do not apologize. Retries remaining: %d." Browser Interaction Patterns: - When elements are not visible, use browser_scroll tools to bring them into viewport - Always capture a screenshot after significant actions to verify state - Use read_browser_page to get the current DOM structure before interacting - Elements are indexed for interaction - use the index from the DOM tree response - If an element is not found, check if it's outside the viewport and scroll first Internal Thought Patterns (Log Messages): The agent emits internal thoughts that appear in logs: - "*Listed Jetski Browser pages*" - "*Took screenshot in Jetski Browser*" - "*Clicked on pixel in Jetski Browser*" - "*Read browser page in Jetski Browser*" - "*Captured DOM tree in Jetski Browser*" Task Completion: - Define clear conditions to return on - After the subagent returns, read the DOM or capture a screenshot to see what it did - All browser interactions are automatically recorded and saved as WebP videos to the artifacts directory

This confirmed that the browser_subagent spins up a dedicated sub-agent with a specific persona ("Jetski Browser") constructed from these embedded fragments. The prompt is assembled from template files at runtime, which explains why it appears fragmented in the binary—each component is stored separately and combined dynamically.

Template-Based Prompt System

This template-based approach allows Google to update prompts without recompiling the binary, and enables dynamic prompt construction based on context, mode, and available tools.

Chapter 4: The Bridge

I had the Brain (Language Server) and the Eyes (Chrome). But how did they talk? The MCP server was the middleman, but it wasn't talking to Chrome directly.

I scrutinized the browser-tools-mcp code. Looking at the source in ~/.npm/_npx/.../browser-tools-mcp/, I saw it making HTTP requests to a discovery endpoint:

It was trying ports 3025-3035, looking for a /.identity endpoint. But nothing showed up on those ports in my initial scan. Then I realized: the server might only be running when a browser session is active.

Found an extension with ID eeijfnjmjelapkebgockoeaadonbchdd. I looked at its manifest:

Digging into its service_worker_binary.js (minified, but readable enough), I found the missing link:

The Extension runs a local HTTP server inside the browser. It receives high-level commands (like "navigate") via HTTP and translates them into low-level CDP WebSocket messages. The /.identity endpoint is how the MCP server discovers it.

Why use an extension instead of talking to CDP directly? CDP is a low-level protocol that reports network idle states, but doesn't indicate when a page is actually ready for interaction. Running code inside the browser via an extension provides several advantages: it can access the DOM directly, handle complex page state, bypass CORS restrictions, and expose a simpler high-level API ("navigate", "click") rather than requiring direct manipulation of the DevTools protocol.

The Synthesis: The 6-Layer Architecture

Putting it all together, here is the complete flow of an Antigravity browser action:

Parting Thoughts

Previous MCP integrations followed a simpler pattern: the IDE would spawn an MCP server as a child process, communicate via STDIO, and expose tools directly to the main AI agent. Tools were typically thin wrappers around existing APIs—file operations, terminal commands, or simple HTTP requests. The agent would call these tools directly, and the MCP server would execute them synchronously.

Antigravity's approach It does it a bit differently. First, it uses a sub-agent pattern: instead of exposing browser tools directly to the main agent, it spawns a dedicated "Jetski" sub-agent with its own system prompt and specialized tool set. This sub-agent runs as a separate AI instance, allowing it to maintain browser-specific context and decision-making logic independently from the main IDE agent.

Second, the MCP server isn't spawned directly by the IDE—it's orchestrated by the language server, which acts as a coordination layer. The language server manages the sub-agent lifecycle, routes tool calls, and handles the translation between the sub-agent's tool invocations and the actual browser automation layer.

Third, instead of using a standard browser automation library like Playwright or Puppeteer directly, Antigravity inserts a Chrome extension as an intermediary. This extension runs an HTTP server inside the browser, providing a high-level API that abstracts away the complexity of Chrome DevTools Protocol while still allowing low-level CDP access when needed.

What's interesting here is that MCP servers aren't just tool providers anymore—they're agent coordinators. When you have something complex like browser automation, you don't want your main agent thinking about DOM elements and network timing. You want it focused on code. So Antigravity delegates: the main agent coordinates, the sub-agent handles browser logic, the language server routes, and the extension executes. Each layer does one thing well.

Alok Bishoyi