Why Browser Agents Are Outpacing Computer-Use Models in Automation

Even though computer-use models are still slow and sometimes unreliable, browser agents are already ready for real-world use. They’re being used in important sectors like healthcare and insurance. In early 2025, OpenAI launched a tool called Operator. It was the first big step in making an AI that could control a browser on its own. The demo showed the AI moving the mouse, clicking buttons, and doing tasks like a human. Many saw it as a sign that we were heading toward truly autonomous AI.

But just a few months later, in August, OpenAI quietly shut down Operator. Instead, they combined its features into ChatGPT’s new Agent Mode. Now, ChatGPT Agents can use both visual and text-based browsers. This change was driven by a simple fact: computer-use models aren’t yet dependable enough for everyday use. They try to interpret the browser as an image and act based on coordinates, like “click at (210, 260).” While this works in theory, in practice it’s fragile. Differences in rendering, delays, and complex layouts make it unreliable. For big companies running thousands of browser sessions, even a 1% failure rate is too high.

Vision-Based Agents and Their Strengths

Vision-based agents see the browser as a visual display. They look at screenshots and interpret them using advanced models that understand images and text together. They then perform actions like “click this button” or “type this word,” based on what they see. This approach mimics how a human interacts with a screen—reading visible text, spotting buttons, and clicking where needed.

One big advantage is that these models don’t need structured data. They only need pixels, making them versatile. But there are downsides, too. Visual models are slower because they have to scroll through pages and analyze images. They also struggle with subtle changes on the page—like knowing if a button is now clickable or if the layout has shifted slightly. This can lead to errors and inconsistent performance.

DOM-Based Agents and the Power of Structure

Unlike vision models, DOM-based agents work directly with the webpage’s structure. The DOM, or Document Object Model, is a tree that represents all elements on a page—like links, buttons, and forms. Instead of looking at pixels, these agents reason over textual descriptions of the page. They see element tags, attributes, and labels, which makes figuring out what to click much easier and faster.

A technique called accessibility snapshots, popularized by Microsoft’s Playwright, turns the live DOM into a readable text format. This way, language models can understand the page better. For example, a Google homepage snippet might list navigation links like “About” and “Store” with their URLs. Using this structured data, agents can precisely target elements—saying “click ref=e47” instead of guessing coordinates. This makes DOM-based control faster, more reliable, and more predictable.

Combining Strengths for Better Browser Automation

In real-world applications, both vision and DOM methods have their place. Vision models are good for handling dynamic, image-heavy interfaces like dashboards or multimedia apps. DOM models excel at text-rich pages such as forms and portals. The best systems today often combine both: default to DOM control but fall back on vision if needed.

OpenAI’s move to integrate both approaches into ChatGPT’s new Agent Mode reflects this. These hybrid agents can choose the best method for each step, making automation much more reliable than Operator was. Models like Claude 4 and open-source options are getting better at visual grounding and faster perception each month. As multimodal architectures improve, pure vision agents might someday match the speed and accuracy needed for mainstream use.

Right now, most enterprise-grade systems rely on a mix—using DOM reasoning for structured parts, vision for complex visuals, and scripting to ensure consistency. It’s not about a single perfect model, but about orchestrating multiple techniques smoothly. The future of browser agents isn’t just about vision or structure alone; it’s about how well these methods can work together.

Looking ahead, the next big challenge is making browser agents not just capable of completing tasks, but of learning from experience. Right now, a successful run doesn’t guarantee they can repeat the process reliably. The goal is for agents to explore workflows visually, then encode what they learn into reusable scripts. This two-step process involves discovering how a page is structured, recording successful navigation paths, and then converting those into automation scripts using tools like Playwright or Selenium.

With the rise of large language models capable of writing and editing code, these agents can self-improve over time. They can generate better scripts after each attempt, becoming faster and more reliable. Over time, they’ll resemble skilled workers—initially slower but eventually able to repeat tasks effortlessly. This hybrid approach, combining visual understanding, structural reasoning, and code synthesis, is making browser automation more robust and adaptable.

Ultimately, the future isn’t about one single type of model. Instead, it’s about integrating multiple techniques—vision, structure, and code—to build smarter, more reliable browser agents. This evolving orchestration promises automation systems that learn, adapt, and become increasingly capable of handling complex, real-world tasks with minimal human intervention.

Inspired by

Sources

Small Language Models Transforming IT and HR Automation
Large language models often steal the spotlight with their impressive ability to analyze vast amounts…
Browser-Based 3D MBD Models Transform Industry Collaboration
Capvidia is changing the way manufacturers and suppliers work together. They’ve developed a new way…
Oracle’s New AI Database Boosts Automation with Built-in Agents
Oracle has released a new version of its long-term support database, called AI Database 26ai,…