How to Build 'Visual Agents' That Browse the Web Like a Human

We are witnessing the most significant shift in web automation since the invention of Selenium in 2004. For two decades, automation engineers have been fighting a losing war against the DOM (Document Object Model).

If you have ever written a bot, you know the struggle: You inspect the HTML, you find a button with id="submit-btn", and you write your script. The next day, the website updates its React frontend, the ID changes to id="submit-btn-v2-X7z", and your production pipeline crashes.

In 2026, the DOM is dead. Long live "Computer Use".

We have entered the era of Visual Action Models (VAMs). Instead of reading code, we give AI "eyes." We take a screenshot of the browser, feed it to a Multimodal LLM (like GPT-4o or DeepSeek-VL), and ask: "Where is the Login button?". The AI returns the X/Y coordinates, and we click it.

This course is your blueprint to building these agents. We will cover the theory, the "Set-of-Mark" prompting technique, the code, and the anti-detection strategies required to automate the un-automatable.


Course Curriculum


Module 1: The Architecture of a Visual Agent

A Visual Agent is not a script; it is a generic reasoning loop. It mimics the human cognitive cycle. Understanding this loop is crucial before writing code.

The "Observe-Think-Act" Loop
  1. Perception (The Eyes): The agent captures the current state of the browser. In 2026, this is not just HTML; it is a high-resolution screenshot + the Accessibility Tree (a simplified version of the HTML).
  2. Cognition (The Brain): The Vision LLM analyzes the screenshot. It maps the user's instruction ("Buy the cheapest laptop") to the visual elements on the screen.
  3. Grounding (The Map): The LLM outputs precise coordinates (e.g., click(450, 800)). This is called "Visual Grounding."
  4. Action (The Hands): The browser controller (Playwright) executes the click, type, or scroll command.

This loop repeats until the goal is achieved. Unlike DOM bots, this agent is resilient. If the "Login" button moves to the left, the agent sees it and clicks left. No code changes required.


Module 2: The Tech Stack (DeepSeek vs. OpenAI)

You need to choose your "Brain" carefully. Visual automation is token-heavy and expensive if optimized poorly.

Option A: The Premium Brain (GPT-4o)
  • Pros: Highest accuracy in reading text from images. Best at complex reasoning.
  • Cons: Expensive (~$0.01 per step). Slower latency.
  • Use Case: Complex logic (e.g., "Find the laptop with the best CPU/Price ratio").
Option B: The Efficiency Brain (DeepSeek-VL-V2)
  • Pros: Extremely cheap API costs (1/10th of OpenAI). Open weights available for local hosting (Privacy).
  • Cons: Slightly higher hallucination rate on small text.
  • Use Case: High-volume scraping (e.g., "Click the 'Next Page' button 1,000 times").
Option C: The Local Brain (Llava / Qwen-VL)
  • Pros: Free (Running on your GPU via Ollama). No privacy leaks.
  • Cons: Requires powerful hardware (RTX 4090 recommended).

Module 3: The "Set-of-Mark" (SoM) Technique

This is the secret sauce. If you simply send a screenshot to GPT-4o and ask "Where is the button?", it often guesses the coordinates wrong. This is called "Spatial Hallucination."

To fix this, researchers at Microsoft developed Set-of-Mark (SoM) Prompting.

How SoM Works:
  1. Before sending the screenshot to the AI, we run a Javascript script to find all interactive elements (buttons, links, inputs).
  2. We draw a Bounding Box with a numeric ID over each element on the screenshot.
  3. We send the marked image to the AI.
  4. We ask: "What is the ID of the Login Button?"
  5. The AI answers: "42".
  6. We look up the coordinates of Box #42 in our database and click it.

This reduces error rates from 20% to less than 1%.


Module 4: Building the Engine (Python Code)

Let's write a production-grade agent using Python, Playwright, and the OpenAI API. We will implement a simplified version of the "SoM" technique.

Step 1: Install Dependencies
pip install playwright openai pillow numpy pandas
Step 2: The "Vision" Controller Script

import base64
import json
import time
from playwright.sync_api import sync_playwright
from openai import OpenAI

# Initialize Client
client = OpenAI(api_key="YOUR_OPENAI_KEY")

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def get_next_action(screenshot_path, objective, url):
    """
    Sends the screenshot to the VLM (Vision Language Model) to decide the next move.
    """
    base64_image = encode_image(screenshot_path)
    
    prompt = f"""
    You are a browser automation agent.
    Objective: {objective}
    Current URL: {url}
    
    Analyze the screenshot. Identify the interactive element that brings us closer to the objective.
    Return JSON format:
    {{
        "action": "click" | "type" | "scroll" | "done",
        "reasoning": "I see the search bar, so I will click it.",
        "location": [x, y],  // Approximate center of the element
        "text_value": "search query" // Only for type action
    }}
    """
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
            ]}
        ],
        response_format={"type": "json_object"},
        max_tokens=500
    )
    
    return json.loads(response.choices[0].message.content)

def run_agent(start_url, objective):
    with sync_playwright() as p:
        # Launch browser (Headless=False to see the magic)
        browser = p.chromium.launch(headless=False)
        context = browser.new_context(viewport={'width': 1280, 'height': 800})
        page = context.new_page()
        page.goto(start_url)
        
        step_count = 0
        max_steps = 15
        
        while step_count < max_steps:
            print(f"--- Step {step_count} ---")
            
            # 1. Take Screenshot
            screenshot_path = f"step_{step_count}.jpg"
            page.screenshot(path=screenshot_path)
            
            # 2. Ask AI
            try:
                decision = get_next_action(screenshot_path, objective, page.url)
                print(f"AI Decision: {decision['reasoning']}")
                
                # 3. Execute Action
                if decision['action'] == "click":
                    x, y = decision['location']
                    page.mouse.click(x, y)
                    
                elif decision['action'] == "type":
                    x, y = decision['location']
                    page.mouse.click(x, y) # Click to focus first
                    page.keyboard.type(decision['text_value'])
                    page.keyboard.press("Enter")
                    
                elif decision['action'] == "scroll":
                    page.mouse.wheel(0, 500)
                    
                elif decision['action'] == "done":
                    print("Objective Achieved!")
                    break
                
                # Wait for page load (Important!)
                time.sleep(3) 
                step_count += 1
                
            except Exception as e:
                print(f"Error: {e}")
                break
                
        browser.close()

# Run the bot
if __name__ == "__main__":
    run_agent("https://amazon.com", "Search for 'Gaming Laptop' and click the first result.")

    

Module 5: Handling "Anti-Bot" Defense (Stealth)

Websites like Cloudflare, Akamai, and Datadome are designed to detect automated browsers. If you use raw Playwright, you will get blocked. You must use Stealth Techniques.

1. The "Ghost Cursor" Strategy

Robots move the mouse in straight lines (A to B). Humans move in curves (Bezier curves) and overshoot targets.
Solution: Use the Ghost Cursor library. It adds micro-jitters and realistic pathing to your mouse movements.

2. Browser Fingerprinting

Sites check your "User Agent" and "Canvas Fingerprint."
Solution: Use playwright-stealth plugin:
from playwright_stealth import stealth_sync
stealth_sync(page)
This injects scripts that hide the fact that the browser is being controlled by automation software.

3. The "CDP" Evasion

Standard bots disable the "Chrome DevTools Protocol" (CDP) detection. Ensure your launch arguments include:
args=["--disable-blink-features=AutomationControlled"].


Module 6: Optimization & Cost Management

Visual Agents are expensive. Sending a 1080p image to GPT-4o every 3 seconds burns tokens. Here is how to reduce costs by 90%.

1. Image Resizing & Grayscale

The AI does not need 4K color to find a button.
Technique: Resize screenshots to 720p and convert them to Grayscale before sending to API. This reduces token size by 50%.

2. Caching the DOM

If the page hasn't changed (Visual Diff), do not call the API again. Compare the current screenshot hash with the previous one.

3. Hybrid Approach

Use "Blind Mode" for simple things. If you know the URL is `google.com`, hardcode the search box interaction. Only switch to "Visual Mode" when something breaks or is dynamic.


Module 7: The "Browser-Use" Library (The Shortcut)

If you don't want to build the engine yourself, use the open-source library that took 2026 by storm: Browser-Use.

Browser-Use is a LangChain wrapper that handles the connection between Playwright and LLMs automatically. It includes self-correction (if the bot misses a click, it retries).


from langchain_openai import ChatOpenAI
from browser_use import Agent

# It handles everything: Vision, Clicking, Scrolling, Context
agent = Agent(
    task="Go to Reddit, search for 'AI Agents', and print the top post title.",
    llm=ChatOpenAI(model="gpt-4o"),
)

result = await agent.run()
print(result)

    

Conclusion: The "Infinite" API

The internet was built for humans. Now that computers can "see" like humans, the internet is effectively one giant API.

With Visual Agents, you can build price trackers for sites that block scrapers, automate government filings, or create competitive intelligence bots. The barrier to entry is no longer coding skill; it is imagination.

Ready to sell your agent? Check out our guide on How to Productize Your Python Scripts on Gumroad.