Skip to content

mamidevs/browser-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Browser Scraper πŸš€

Multi-agent browser automation framework based on page-agent. Uses DOM-based manipulation for fast and reliable web interaction.

✨ Features

  • πŸ€– Multi-Agent Workflow: Planner β†’ Browser Agent β†’ Extractor pipeline
  • πŸ“„ DOM-Based: Direct DOM manipulation (no screenshots needed)
  • 🎯 Flexible: Works in user's real browser or headless
  • ⚑ Fast: Direct DOM operations, no image processing
  • πŸ”„ Caching: Built-in result caching for performance
  • πŸ”Œ MCP Server: External control via Model Context Protocol
  • 🎨 Full Toolset: All page-agent tools enabled (javascript, user interaction, etc.)

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   PLANNER    │────▢│   SCRAPER    │────▢│  EXTRACTOR   β”‚
β”‚  (qwen3.5)   β”‚     β”‚  (glm-5.1)   β”‚     β”‚  (gemini)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚  PAGE-AGENT  β”‚
                  β”‚  (DOM-based) β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚   BROWSER    β”‚
                  β”‚  (User's)    β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“¦ Installation

# Clone repository
git clone https://github.com/mamidevs/browser-scraper.git
cd browser-scraper

# Install dependencies
npm install

# Setup environment variables
cp .env.example .env
# Edit .env and add your OpenRouter API key

πŸ”§ Configuration

Create .env file:

# OpenRouter API Key
OPENROUTER_API_KEY=your_api_key_here

# Model Configuration (optional - defaults shown)
PLANNER_MODEL=qwen/qwen3.5-27b
SCRAPER_MODEL=z-ai/glm-5.1
EXTRACTOR_MODEL=google/gemini-2.5-flash

# Base URL (optional)
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1

πŸš€ Usage

Basic Scraping (Single Agent)

import { ScrapingAgent } from 'browser-scraper';

const agent = new ScrapingAgent({
    model: 'z-ai/glm-5.1',
    apiKey: process.env.OPENROUTER_API_KEY,
});

await agent.initialize();

// Navigate and extract
await agent.navigate('https://example.com');
const result = await agent.execute('Extract all product names and prices');

console.log(result.data);

Multi-Agent Workflow

import { ScrapingOrchestrator } from 'browser-scraper';

const orchestrator = new ScrapingOrchestrator({
    planner: { model: 'qwen/qwen3.5-27b', apiKey: '...' },
    scraper: { model: 'z-ai/glm-5.1', apiKey: '...' },
    extractor: { model: 'google/gemini-2.5-flash', apiKey: '...' },
});

await orchestrator.initialize();

const result = await orchestrator.scrape({
    task: 'Find all laptops under $1000 with ratings above 4.0',
    url: 'https://amazon.com',
    schema: {
        type: 'array',
        items: {
            type: 'object',
            properties: {
                name: { type: 'string' },
                price: { type: 'number' },
                rating: { type: 'number' }
            }
        }
    }
});

console.log('Plan:', result.plan);
console.log('Data:', result.data);

Batch Scraping

const urls = [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3',
];

const results = await orchestrator.scrapeBatch(
    urls.map(url => ({
        task: 'Extract product names and prices',
        url,
        schema: { /* ... */ }
    })),
    3 // concurrency
);

MCP Server

import { ScrapingMCPServer } from 'browser-scraper';

const server = new ScrapingMCPServer();

await server.initialize({
    planner: { model: 'qwen/qwen3.5-27b', apiKey: '...' },
    scraper: { model: 'z-ai/glm-5.1', apiKey: '...' },
    extractor: { model: 'google/gemini-2.5-flash', apiKey: '...' },
});

// Available tools:
// - scrape-url: Single URL scraping
// - scrape-batch: Multiple URLs in parallel
// - extract-schema: Schema-based extraction
// - clear-cache: Clear all caches
// - get-stats: Get statistics

// Call from external MCP client
const result = await server.callTool({
    name: 'scrape-url',
    arguments: {
        url: 'https://example.com',
        task: 'Extract all links',
    }
});

πŸ“š Examples

Run the examples:

# Basic scraping
npm run example:basic

# Multi-agent workflow
npm run example:multi

# MCP client
npm run example:mcp

🎯 Use Cases

1. E-commerce Price Monitoring

const result = await orchestrator.scrape({
    task: 'Extract product name, price, availability, and ratings',
    url: 'https://amazon.com/dp/B08N5KWB9H',
    schema: {
        type: 'object',
        properties: {
            name: { type: 'string' },
            price: { type: 'number' },
            currency: { type: 'string' },
            availability: { type: 'string' },
            rating: { type: 'number' },
            reviews: { type: 'integer' }
        }
    }
});

2. Job Board Aggregation

const result = await orchestrator.scrape({
    task: `
        1. Search for "software engineer" jobs
        2. Filter by location "Remote"
        3. Extract job title, company, salary, and application URL
    `,
    url: 'https://linkedin.com/jobs',
    schema: {
        type: 'array',
        items: {
            type: 'object',
            properties: {
                title: { type: 'string' },
                company: { type: 'string' },
                salary: { type: 'string' },
                location: { type: 'string' },
                applyUrl: { type: 'string' }
            }
        }
    }
});

3. News Article Collection

const urls = [
    'https://techcrunch.com',
    'https://theverge.com',
    'https://arstechnica.com',
];

const results = await orchestrator.scrapeBatch(
    urls.map(url => ({
        task: 'Extract article title, author, publish date, and summary',
        url,
        schema: {
            type: 'array',
            items: {
                type: 'object',
                properties: {
                    title: { type: 'string' },
                    author: { type: 'string' },
                    date: { type: 'string' },
                    summary: { type: 'string' },
                    url: { type: 'string' }
                }
            }
        }
    })),
    3
);

πŸ”§ API Reference

ScrapingAgent

Single-agent wrapper around page-agent.

Methods:

  • initialize(): Initialize the agent (must be called first)
  • navigate(url): Navigate to a URL
  • execute(task): Execute a natural language task
  • extractData(schema): Extract structured data using schema
  • clearCache(): Clear cached results

ScrapingOrchestrator

Multi-agent orchestration.

Methods:

  • initialize(): Initialize all agents
  • scrape(taskConfig): Execute scraping with multi-agent workflow
  • scrapeBatch(tasks, concurrency): Execute multiple scraping tasks in parallel
  • clearCache(): Clear all caches

ScrapingMCPServer

MCP server for external control.

Tools:

  • scrape-url: Single URL scraping
  • scrape-batch: Batch scraping
  • extract-schema: Schema-based extraction
  • clear-cache: Clear cache
  • get-stats: Get statistics

πŸƒ Running

# Build
npm run build

# Run examples
npm run example:basic
npm run example:multi
npm run example:mcp

# Test
npm test

πŸ“ Project Structure

browser-scraper/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ agents/
β”‚   β”‚   └── orchestrator.ts    # Multi-agent coordinator
β”‚   β”œβ”€β”€ page-agent/
β”‚   β”‚   └── wrapper.ts         # Page-agent integration
β”‚   β”œβ”€β”€ storage/
β”‚   β”‚   └── cache.ts           # Caching layer
β”‚   β”œβ”€β”€ mcp/
β”‚   β”‚   └── server.ts          # MCP server
β”‚   └── index.ts               # Main entry
β”œβ”€β”€ examples/
β”‚   β”œβ”€β”€ basic.ts               # Basic example
β”‚   β”œβ”€β”€ multi-agent.ts         # Multi-agent workflow
β”‚   └── mcp-client.ts          # MCP client
β”œβ”€β”€ tests/
β”œβ”€β”€ package.json
β”œβ”€β”€ tsconfig.json
└── README.md

πŸ”’ Privacy & Ethics

  • Respect robots.txt: Always check website's robots.txt
  • Rate limiting: Use caching and respect rate limits
  • Terms of Service: Ensure scraping is allowed by ToS
  • Data usage: Use scraped data responsibly and legally

🀝 Contributing

Contributions welcome! Please read CONTRIBUTING.md first.

πŸ“„ License

MIT License - see LICENSE file for details.

πŸ”— Related Projects

πŸ“ž Support


Built with ❀️ using page-agent and Hermes Agent

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors