An AI-powered automation tool that controls web browsers through natural language commands
- Overview
- Features
- Quick Start
- Usage Examples
- Configuration
- Browser Agent Tools
- Architecture
- Project Structure
- Technical Details
- Error Handling
- Advanced Usage
- Troubleshooting
- Use Cases
- Limitations
- Roadmap
- Contributing
- License
Browser Agent is a sophisticated Python application that enables users to control web browsers through natural language instructions. It uses a combination of AI models, browser automation, and DOM analysis to navigate websites, fill forms, click buttons, and perform complex web tasks based on natural language descriptions.
The agent features intelligent page analysis, robust error handling, and flexible interaction capabilities, making it ideal for automating repetitive browser tasks, web scraping, site testing, and interactive browsing sessions.
Why use Browser Agent?
- Simplify Web Automation - No more complex automation scripts or browser extensions
- Reduce Learning Curve - Use natural language instead of programming syntax
- Improve Productivity - Automate repetitive web tasks with minimal effort
- Enhance Accessibility - Enable browser control for users with limited technical knowledge
- Rapid Prototyping - Quickly test and iterate on web workflows
- 🗣️ Natural Language Control: Control your browser with simple human language instructions
- 🔍 Intelligent Page Analysis: Automatic detection and mapping of interactive elements on web pages
- 🧭 Context-Aware Navigation: Smart navigation with history tracking and state awareness
- 📝 Form Handling: Fill forms, select options from dropdowns, and submit data seamlessly
- ⚡ Dynamic Content Support: Handle AJAX, infinite scrolling, popups, and dynamically loaded content
- 🔄 Error Recovery: Robust error detection and recovery strategies
- 👤 User Interaction: Request information from the user during task execution when needed
- 🌍 Multi-Browser Support: Connect to existing Chrome browsers or launch new instances
- Python 3.11 or higher
- uv (Python package manager)
- Chrome browser (for existing browser connection)
-
Clone the repository
git clone https://github.com/yourusername/browser-agent.git cd browser-agent
-
Install Playwright browsers
playwright install
-
Set up environment variables
Create a
.env
file in the project root:OPENAI_API_KEY=your_api_key_here AZURE_ENDPOINT=your_azure_endpoint # If using Azure OpenAI
Or set them directly in your terminal:
export OPENAI_API_KEY=your_api_key_here export AZURE_ENDPOINT=your_azure_endpoint
-
Run the Browser Agent
uv run main.py
# Start a new browser session with the agent
uv run main.py run
# Launch with a specific task
uv run main.py run --task "Go to example.com and click the signup button"
# Run in headless mode (no visible browser window)
uv main.py run --headless
# Launch Chrome with remote debugging enabled
uv main.py launch --port 9222
# Run with verbose debug logging
uv main.py debug
When the Browser Agent is running, you can provide natural language instructions of medium complexity that it will execute efficiently:
# Research and Summarize
Enter your instruction: Go to Wikipedia, search for "artificial intelligence ethics", find the main concerns section, and summarize the key points
# Online Shopping Assistant
Enter your instruction: Search for a mid-range laptop on Amazon with at least 16GB RAM, buy the best valued and rated laptop.
# News Aggregation
Enter your instruction: Visit three major news sites, find articles about climate change from the past week, and create a summary of the main developments
# Recipe Finder
Enter your instruction: Find a chicken curry recipe that takes less than 30 minutes to prepare, has good reviews, and doesn't require specialty ingredients
# Product Comparison
Enter your instruction: Compare the features and prices of the latest iPhone and Samsung Galaxy models, focusing on camera quality and battery life
# Event Planning
Enter your instruction: Search for outdoor concerts in San Francisco next month, check the weather forecast for those dates, and recommend the best weekend to attend
# Job Search
Enter your instruction: Find software developer jobs in Boston that allow remote work, require Python experience, and were posted in the last week
# Email Writing
Enter your instruction: Write an email based on the extracted data or task completed. The email should be clear, professional, and suitable to send to a colleague or manager. Include a brief summary of what was done, any important findings, and next steps if applicable.
These examples show how Browser Agent can handle tasks that involve multiple steps across one or more websites, gathering specific information based on criteria, and providing summarized results.
Create a .env
file with the following variables:
OPENAI_API_KEY=your_api_key_here
AZURE_ENDPOINT=your_azure_endpoint # If using Azure OpenAI
Configure browser settings in configurations/config.py
:
BROWSER_OPTIONS = {
"headless": False, # Run browser headlessly or with UI
"slowmo": 0, # Slow down actions by milliseconds (for debugging)
"timeout": 30000, # Default timeout in milliseconds
"viewport": { # Browser window size
"width": 1280,
"height": 800
}
}
BROWSER_CONNECTION = {
"use_existing": False, # Connect to existing Chrome instance
"cdp_endpoint": None, # Custom CDP endpoint (e.g., "http://localhost:9222")
"fallback_to_new": True, # Fallback to launching new browser if connection fails
}
Agent Configuration
Modify agent/agent.py
to configure the LLM settings:
# LLM Model settings
llm = AzureChatOpenAI(
model_name="gpt-4o", # Model to use
openai_api_key=api_key, # API key
temperature=0, # Deterministic output (0) to creative (1)
api_version="2024-12-01-preview",
azure_endpoint=os.getenv("AZURE_ENDPOINT"),
)
The agent uses these specialized tools to control the browser:
Tool | Description | Example |
---|---|---|
analyze_page | Scans the DOM to create a map of all visible elements with numbered IDs | analyze_page() |
click | Clicks on elements using IDs from the page analysis | click("[3][button]Submit") |
fill_input | Fills text into input fields | fill_input('{"id":"5","type":"input","value":"example text"}') |
select_option | Selects options from dropdown menus | select_option('{"id":"8","value":"Option 2"}') |
keyboard_action | Sends keyboard shortcuts and special keys | keyboard_action("Enter") |
navigate | Opens a URL in the browser | navigate("https://example.com") |
go_back | Navigates to the previous page in browser history | go_back() |
scroll | Scrolls the viewport in specified directions | scroll("down") |
ask_user | Requests information from the user during task execution | ask_user('{"prompt":"Enter password","type":"password"}') |
When the agent analyzes a page, it assigns numeric IDs to each interactive element:
[1][button]Sign Up
[2][link]Learn More
[3][input]Search
The agent can then reference these elements by their ID, type, and text:
click("[1][button]Sign Up")
The Browser Agent is structured in a modular fashion:
- Agent Layer: Manages AI interactions using LangChain and LangGraph
- Browser Layer: Controls web browsers via Playwright
- CLI Layer: Provides command-line interface and configuration
- Tools Layer: Implements specialized browser interaction capabilities
- User provides natural language instruction
- Agent processes instruction and plans actions
- Agent executes actions using browser tools
- Browser interacts with web pages
- Page analyzer extracts information from DOM
- Agent interprets results and plans next actions
- Agent provides results back to user
browser_agent/
├── agent/
│ └── agent.py # AI agent implementation
├── browser/
│ ├── analyzers/
│ │ └── page_analyzer.py # DOM analysis tools
│ ├── controllers/
│ │ ├── browser_controller.py # Main browser interface
│ │ ├── element_controller.py # Element interaction
│ │ └── keyboard_controller.py # Keyboard actions
│ ├── navigation/
│ │ ├── navigator.py # URL and history navigation
│ │ └── scroll_manager.py # Scrolling capabilities
│ ├── utils/
│ │ ├── dom_helpers.py # DOM manipulation helpers
│ │ ├── input_helpers.py # Input processing utilities
│ │ └── user_interaction.py # User interaction tools
│ └── browser_setup.py # Browser initialization
├── cli/
│ ├── chrome_launcher.py # Chrome debugging launcher
│ └── commands.py # CLI command definitions
├── configurations/
│ └── config.py # Configuration settings
├── main.py # Application entry point
└── requirements.txt # Dependencies
The Browser Agent is built on LangGraph, a stateful workflow framework for LLM applications. The agent uses a GPT model from Azure OpenAI with specialized tools for browser control.
Agent Workflow
- Instruction Processing: The LLM parses the natural language instruction
- Tool Selection: The agent decides which browser tools to use
- Action Execution: The agent executes actions sequentially
- Progress Monitoring: The agent tracks state changes and adapts
- Result Generation: The agent compiles results for the user
The system uses Playwright for browser automation, providing cross-browser compatibility and powerful DOM manipulation capabilities.
The Page Analyzer uses advanced JavaScript to scan the DOM and identify interactive elements, creating a numbered map of elements that the agent can reference with precise IDs.
The Browser Agent employs sophisticated error handling:
- Element Detection: Re-analysis after scrolling or waiting for dynamic content
- Click Failures: Alternative selection strategies and visibility checks
- Navigation Issues: Error detection and recovery for failed navigations
- Form Validation: Reading and addressing validation errors
- Session Management: Detection and handling of timeouts and authentication issues
You can connect the Browser Agent to an already running Chrome instance:
-
Launch Chrome with remote debugging enabled:
python main.py launch --port 9222
-
Configure the Browser Agent to use the existing instance:
# In configurations/config.py BROWSER_CONNECTION = { "use_existing": True, "cdp_endpoint": "http://localhost:9222", }
-
Run the Browser Agent:
python main.py run
You can chain multiple instructions into a single workflow:
python main.py run --task "Go to gmail.com, wait for the login page,
enter the username 'test@example.com', click next, wait for the password field,
ask me for the password, enter it, and click sign in"
The agent can prompt for user input during execution:
# Example of how the agent uses the ask_user tool
ask_user('{"prompt":"Please enter your 2FA code","type":"text"}')
Browser Connection Issues
Problem: Unable to connect to Chrome with debugging enabled
Solutions:
- Ensure Chrome is not already running with the same debugging port
- Try a different port:
python main.py launch --port 9223
- Check firewall settings that might block the connection
- Verify Chrome is installed in the default location or set the path manually
Page Analysis Problems
Problem: Agent can't find or interact with elements on the page
Solutions:
- Give the page more time to load completely before analyzing
- For dynamic content, ask the agent to scroll and re-analyze
- Make your element references more specific
- For highly dynamic sites, try slowing down agent actions with the
slowmo
option
Authentication Challenges
Problem: Agent can't handle login procedures with CAPTCHA or 2FA
Solutions:
- Pre-authenticate in the browser before connecting the agent
- Use the
ask_user
tool to get manual input for verification challenges - Consider using cookies or saved sessions for sites you access frequently
-
Web Automation: Automate repetitive web tasks with natural language instructions
Fill out the same form on 20 different websites with my business information
-
Site Testing: Test web applications with natural language test cases
Visit our app, try to create a new account, and report any errors
-
Data Collection: Extract information from websites
Go to the weather website and collect the 5-day forecast for New York, Chicago, and Los Angeles
-
Interactive Assistance: Guide users through complex web processes
Help me book a flight from New York to San Francisco for next Friday, returning Sunday
-
Prototyping: Quickly test web workflows without writing code
Try our new checkout process with different payment methods and report the experience
- Cannot handle CAPTCHA or authentication challenges requiring human verification
- May face challenges with highly dynamic web apps that change rapidly
- Not designed for high-security operations (banking, etc.)
- Performance depends on the complexity of the website and instructions
- Cannot interact with elements that require hover actions only (without clickable alternatives)
- May struggle with websites that heavily use canvas or WebGL for rendering
- Multi-session Support: Run multiple browser sessions simultaneously
- Screenshot Capabilities: Capture and analyze visual elements
- PDF Processing: Extract information from PDFs displayed in browsers
- Enhanced Error Recovery: More sophisticated recovery strategies
- User Interface: Add a web-based UI for easier interaction
- Action Recording: Record and replay browser sessions
- Customizable Agents: User-defined agent personalities and capabilities
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Made with ❤️ by rkvalandasu