🌐 Browser Agent

An AI-powered automation tool that controls web browsers through natural language commands

AI Browser Navigation Demo

Click the image above to watch the demo video

📋 Table of Contents

Overview
Features
Quick Start
Usage Examples
Configuration
Browser Agent Tools
Architecture
Project Structure
Technical Details
Error Handling
Advanced Usage
Troubleshooting
Use Cases
Limitations
Roadmap
Contributing
License

🔍 Overview

Browser Agent is a sophisticated Python application that enables users to control web browsers through natural language instructions. It uses a combination of AI models, browser automation, and DOM analysis to navigate websites, fill forms, click buttons, and perform complex web tasks based on natural language descriptions.

The agent features intelligent page analysis, robust error handling, and flexible interaction capabilities, making it ideal for automating repetitive browser tasks, web scraping, site testing, and interactive browsing sessions.

Why use Browser Agent?

Simplify Web Automation - No more complex automation scripts or browser extensions
Reduce Learning Curve - Use natural language instead of programming syntax
Improve Productivity - Automate repetitive web tasks with minimal effort
Enhance Accessibility - Enable browser control for users with limited technical knowledge
Rapid Prototyping - Quickly test and iterate on web workflows

✨ Features

🗣️ Natural Language Control: Control your browser with simple human language instructions
🔍 Intelligent Page Analysis: Automatic detection and mapping of interactive elements on web pages
🧭 Context-Aware Navigation: Smart navigation with history tracking and state awareness
📝 Form Handling: Fill forms, select options from dropdowns, and submit data seamlessly
⚡ Dynamic Content Support: Handle AJAX, infinite scrolling, popups, and dynamically loaded content
🔄 Error Recovery: Robust error detection and recovery strategies
👤 User Interaction: Request information from the user during task execution when needed
🌍 Multi-Browser Support: Connect to existing Chrome browsers or launch new instances

🚀 Quick Start

Prerequisites

Python 3.11 or higher
uv (Python package manager)
Chrome browser (for existing browser connection)

Step-by-Step Installation

Clone the repository

git clone https://github.com/yourusername/browser-agent.git
cd browser-agent

Install Playwright browsers
```
playwright install
```

Set up environment variables

Create a .env file in the project root:

OPENAI_API_KEY=your_api_key_here
AZURE_ENDPOINT=your_azure_endpoint  # If using Azure OpenAI

Or set them directly in your terminal:

export OPENAI_API_KEY=your_api_key_here
export AZURE_ENDPOINT=your_azure_endpoint

Run the Browser Agent
```
uv run main.py
```

🎮 Usage Examples

Basic Commands

# Start a new browser session with the agent
uv run main.py run

# Launch with a specific task
uv run main.py run --task "Go to example.com and click the signup button"

# Run in headless mode (no visible browser window)
uv main.py run --headless

# Launch Chrome with remote debugging enabled
uv main.py launch --port 9222

# Run with verbose debug logging
uv main.py debug

Interactive Examples

When the Browser Agent is running, you can provide natural language instructions of medium complexity that it will execute efficiently:

# Research and Summarize
Enter your instruction: Go to Wikipedia, search for "artificial intelligence ethics", find the main concerns section, and summarize the key points

# Online Shopping Assistant
Enter your instruction: Search for a mid-range laptop on Amazon with at least 16GB RAM, buy the best valued and rated laptop.

# News Aggregation
Enter your instruction: Visit three major news sites, find articles about climate change from the past week, and create a summary of the main developments

# Recipe Finder
Enter your instruction: Find a chicken curry recipe that takes less than 30 minutes to prepare, has good reviews, and doesn't require specialty ingredients

# Product Comparison
Enter your instruction: Compare the features and prices of the latest iPhone and Samsung Galaxy models, focusing on camera quality and battery life

# Event Planning
Enter your instruction: Search for outdoor concerts in San Francisco next month, check the weather forecast for those dates, and recommend the best weekend to attend

# Job Search
Enter your instruction: Find software developer jobs in Boston that allow remote work, require Python experience, and were posted in the last week

# Email Writing
Enter your instruction: Write an email based on the extracted data or task completed. The email should be clear, professional, and suitable to send to a colleague or manager. Include a brief summary of what was done, any important findings, and next steps if applicable.

These examples show how Browser Agent can handle tasks that involve multiple steps across one or more websites, gathering specific information based on criteria, and providing summarized results.

⚙️ Configuration

Environment Variables

Create a .env file with the following variables:

OPENAI_API_KEY=your_api_key_here
AZURE_ENDPOINT=your_azure_endpoint  # If using Azure OpenAI

Browser Configuration

Configure browser settings in configurations/config.py:

BROWSER_OPTIONS = {
    "headless": False,         # Run browser headlessly or with UI
    "slowmo": 0,               # Slow down actions by milliseconds (for debugging)
    "timeout": 30000,          # Default timeout in milliseconds
    "viewport": {              # Browser window size
        "width": 1280,
        "height": 800
    }
}

BROWSER_CONNECTION = {
    "use_existing": False,     # Connect to existing Chrome instance
    "cdp_endpoint": None,      # Custom CDP endpoint (e.g., "http://localhost:9222")
    "fallback_to_new": True,   # Fallback to launching new browser if connection fails
}

Advanced Configuration Options

Agent Configuration

Modify agent/agent.py to configure the LLM settings:

# LLM Model settings
llm = AzureChatOpenAI(
    model_name="gpt-4o",           # Model to use
    openai_api_key=api_key,        # API key
    temperature=0,                 # Deterministic output (0) to creative (1)
    api_version="2024-12-01-preview",
    azure_endpoint=os.getenv("AZURE_ENDPOINT"),
)

🛠️ Browser Agent Tools

The agent uses these specialized tools to control the browser:

Tool	Description	Example
analyze_page	Scans the DOM to create a map of all visible elements with numbered IDs	`analyze_page()`
click	Clicks on elements using IDs from the page analysis	`click("[3][button]Submit")`
fill_input	Fills text into input fields	`fill_input('{"id":"5","type":"input","value":"example text"}')`
select_option	Selects options from dropdown menus	`select_option('{"id":"8","value":"Option 2"}')`
keyboard_action	Sends keyboard shortcuts and special keys	`keyboard_action("Enter")`
navigate	Opens a URL in the browser	`navigate("https://example.com")`
go_back	Navigates to the previous page in browser history	`go_back()`
scroll	Scrolls the viewport in specified directions	`scroll("down")`
ask_user	Requests information from the user during task execution	`ask_user('{"prompt":"Enter password","type":"password"}')`

How Element References Work

When the agent analyzes a page, it assigns numeric IDs to each interactive element:

[1][button]Sign Up
[2][link]Learn More
[3][input]Search

The agent can then reference these elements by their ID, type, and text:

click("[1][button]Sign Up")

🏗️ Architecture

The Browser Agent is structured in a modular fashion:

Core Components

Agent Layer: Manages AI interactions using LangChain and LangGraph
Browser Layer: Controls web browsers via Playwright
CLI Layer: Provides command-line interface and configuration
Tools Layer: Implements specialized browser interaction capabilities

Data Flow

User provides natural language instruction
Agent processes instruction and plans actions
Agent executes actions using browser tools
Browser interacts with web pages
Page analyzer extracts information from DOM
Agent interprets results and plans next actions
Agent provides results back to user

📂 Project Structure

browser_agent/
├── agent/
│   └── agent.py             # AI agent implementation
├── browser/
│   ├── analyzers/
│   │   └── page_analyzer.py # DOM analysis tools
│   ├── controllers/
│   │   ├── browser_controller.py  # Main browser interface
│   │   ├── element_controller.py  # Element interaction
│   │   └── keyboard_controller.py # Keyboard actions
│   ├── navigation/
│   │   ├── navigator.py      # URL and history navigation
│   │   └── scroll_manager.py # Scrolling capabilities
│   ├── utils/
│   │   ├── dom_helpers.py     # DOM manipulation helpers
│   │   ├── input_helpers.py   # Input processing utilities
│   │   └── user_interaction.py # User interaction tools
│   └── browser_setup.py      # Browser initialization
├── cli/
│   ├── chrome_launcher.py    # Chrome debugging launcher
│   └── commands.py           # CLI command definitions
├── configurations/
│   └── config.py             # Configuration settings
├── main.py                   # Application entry point
└── requirements.txt          # Dependencies

🔧 Technical Details

AI Agent Framework

The Browser Agent is built on LangGraph, a stateful workflow framework for LLM applications. The agent uses a GPT model from Azure OpenAI with specialized tools for browser control.

Agent Workflow

Instruction Processing: The LLM parses the natural language instruction
Tool Selection: The agent decides which browser tools to use
Action Execution: The agent executes actions sequentially
Progress Monitoring: The agent tracks state changes and adapts
Result Generation: The agent compiles results for the user

Browser Automation

The system uses Playwright for browser automation, providing cross-browser compatibility and powerful DOM manipulation capabilities.

Element Detection

The Page Analyzer uses advanced JavaScript to scan the DOM and identify interactive elements, creating a numbered map of elements that the agent can reference with precise IDs.

🚨 Error Handling

The Browser Agent employs sophisticated error handling:

Element Detection: Re-analysis after scrolling or waiting for dynamic content
Click Failures: Alternative selection strategies and visibility checks
Navigation Issues: Error detection and recovery for failed navigations
Form Validation: Reading and addressing validation errors
Session Management: Detection and handling of timeouts and authentication issues

🔍 Advanced Usage

Connecting to Existing Chrome Instances

You can connect the Browser Agent to an already running Chrome instance:

Launch Chrome with remote debugging enabled:
```
python main.py launch --port 9222
```

Configure the Browser Agent to use the existing instance:

# In configurations/config.py
BROWSER_CONNECTION = {
    "use_existing": True,
    "cdp_endpoint": "http://localhost:9222",
}

Run the Browser Agent:
```
python main.py run
```

Creating Multi-Step Automated Workflows

You can chain multiple instructions into a single workflow:

python main.py run --task "Go to gmail.com, wait for the login page,
enter the username 'test@example.com', click next, wait for the password field,
ask me for the password, enter it, and click sign in"

User Interaction During Execution

The agent can prompt for user input during execution:

# Example of how the agent uses the ask_user tool
ask_user('{"prompt":"Please enter your 2FA code","type":"text"}')

🔍 Troubleshooting

Browser Connection Issues

Problem: Unable to connect to Chrome with debugging enabled

Solutions:

Ensure Chrome is not already running with the same debugging port
Try a different port: python main.py launch --port 9223
Check firewall settings that might block the connection
Verify Chrome is installed in the default location or set the path manually

Page Analysis Problems

Problem: Agent can't find or interact with elements on the page

Solutions:

Give the page more time to load completely before analyzing
For dynamic content, ask the agent to scroll and re-analyze
Make your element references more specific
For highly dynamic sites, try slowing down agent actions with the slowmo option

Authentication Challenges

Problem: Agent can't handle login procedures with CAPTCHA or 2FA

Solutions:

Pre-authenticate in the browser before connecting the agent
Use the ask_user tool to get manual input for verification challenges
Consider using cookies or saved sessions for sites you access frequently

💼 Use Cases

Web Automation: Automate repetitive web tasks with natural language instructions
```
Fill out the same form on 20 different websites with my business information
```
Site Testing: Test web applications with natural language test cases
```
Visit our app, try to create a new account, and report any errors
```

Data Collection: Extract information from websites

Go to the weather website and collect the 5-day forecast for New York, Chicago, and Los Angeles

Interactive Assistance: Guide users through complex web processes

Help me book a flight from New York to San Francisco for next Friday, returning Sunday

Prototyping: Quickly test web workflows without writing code

Try our new checkout process with different payment methods and report the experience

⚠️ Limitations

Cannot handle CAPTCHA or authentication challenges requiring human verification
May face challenges with highly dynamic web apps that change rapidly
Not designed for high-security operations (banking, etc.)
Performance depends on the complexity of the website and instructions
Cannot interact with elements that require hover actions only (without clickable alternatives)
May struggle with websites that heavily use canvas or WebGL for rendering

🛣️ Roadmap

Multi-session Support: Run multiple browser sessions simultaneously
Screenshot Capabilities: Capture and analyze visual elements
PDF Processing: Extract information from PDFs displayed in browsers
Enhanced Error Recovery: More sophisticated recovery strategies
User Interface: Add a web-based UI for easier interaction
Action Recording: Record and replay browser sessions
Customizable Agents: User-defined agent personalities and capabilities

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Made with ❤️ by rkvalandasu

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
agent		agent
browser		browser
cli		cli
configurations		configurations
images		images
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Readme.md		Readme.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

License

rkvalandas/browser_agent

Folders and files

Latest commit

History

Repository files navigation

🌐 Browser Agent

AI Browser Navigation Demo

📋 Table of Contents

🔍 Overview

✨ Features

🚀 Quick Start

Prerequisites

Step-by-Step Installation

🎮 Usage Examples

Basic Commands

Interactive Examples

⚙️ Configuration

Environment Variables

Browser Configuration

Advanced Configuration Options

🛠️ Browser Agent Tools

How Element References Work

🏗️ Architecture

Core Components

Data Flow

📂 Project Structure

🔧 Technical Details

AI Agent Framework

Browser Automation

Element Detection

🚨 Error Handling

🔍 Advanced Usage

Connecting to Existing Chrome Instances

Creating Multi-Step Automated Workflows

User Interaction During Execution

🔍 Troubleshooting

💼 Use Cases

⚠️ Limitations

🛣️ Roadmap

🤝 Contributing

📄 License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages