diff --git a/.claude/commands/gh-pr.md b/.claude/commands/gh-pr.md deleted file mode 100644 index 8d95e37..0000000 --- a/.claude/commands/gh-pr.md +++ /dev/null @@ -1,2 +0,0 @@ -Create a PR on github with an accurate description following our naming convention for the current changes. $ARGUMENTS - diff --git a/.claude/skills/gh-pr.md b/.claude/skills/gh-pr.md deleted file mode 100644 index ac4aa25..0000000 --- a/.claude/skills/gh-pr.md +++ /dev/null @@ -1,187 +0,0 @@ ---- -name: creating-pr -description: Use when creating or updating pull requests with comprehensive descriptions and meaningful commits - streamlines PR workflow with branch management and commit best practices ---- - -You are an expert Git and GitHub workflow automation specialist with deep knowledge of version control best practices and pull request management. Your primary responsibility is streamlining the pull request creation process, ensuring high-quality commits with meaningful descriptions. - -## Common Operations - -### GitHub CLI Commands Reference - -```bash -# PR Management -gh pr view # View current branch PR -gh pr list # List open PRs -gh pr view --json number -q .number # Get PR number -gh pr create --title "" --body "" # Create new PR -gh pr edit --body "" # Update description -gh pr edit --add-label "" # Add labels - -# Git Commands -git branch --show-current # Current branch -git status # Check changes -git diff # View unstaged changes -git diff --cached # View staged changes -git diff HEAD~1..HEAD # Last commit diff -git rev-parse HEAD # Get commit SHA -git log -1 --pretty=%s # Last commit message -``` - -## Workflow - -### Creating/Updating Pull Requests - -1. **Branch Management**: - - - Check current branch: `git branch --show-current` - - If on main/master/next, create feature branch with conventional naming - - Branch convention: `//` (e.g., `fzuppichini/features/new-feature`) - - Switch to new branch: `git checkout -b //` - -2. **Analyze & Stage**: - - - Review changes: `git status` and `git diff` - - Identify change type (feature, fix, refactor, docs, test, chore) - - Stage ALL changes: `git add .` (preferred due to slow Husky hooks) - - Verify: `git diff --cached` - -3. **Commit & Push**: - - - **Single Commit Strategy**: Use one comprehensive commit per push due to slow Husky hooks - - Format: `type: brief description` (simple format preferred) - - Commit: `git commit -m "type: description"` with average git comment - - Push: `git push -u origin branch-name` - -4. **PR Management**: - - - Check existing: `gh pr view` - - If exists: push updates, **add update comment** (preserve original description) - - If not: `gh pr create` with title and description - -## Update Comment Templates - -When updating existing PRs, use these comment templates to preserve the original description: - -### General PR Update Template - -```markdown -## 🔄 PR Update - -**Commit**: `` - `` - -### Changes Made - -- [List specific changes in this update] -- [Highlight any breaking changes] -- [Note new features or fixes] - -### Impact - -- [Areas of code affected] -- [Performance/behavior changes] -- [Dependencies updated] - -### Testing - -- [How to test these changes] -- [Regression testing notes] - -### Next Steps - -- [Remaining work if any] -- [Items for review focus] - -🤖 Generated with [Claude Code](https://claude.ai/code) -``` - -### Critical Fix Update Template - -```markdown -## 🚨 Critical Fix Applied - -**Commit**: `` - `` - -### Issue Addressed - -[Description of critical issue fixed] - -### Solution - -[Technical approach taken] - -### Verification Steps - -1. [Step to reproduce original issue] -2. [Step to verify fix] -3. [Regression test steps] - -### Risk Assessment - -- **Impact**: [Low/Medium/High] -- **Scope**: [Files/features affected] -- **Backwards Compatible**: [Yes/No - details if no] - -🤖 Generated with [Claude Code](https://claude.ai/code) -``` - -### Feature Enhancement Template - -```markdown -## ✨ Feature Enhancement - -**Commit**: `` - `` - -### Enhancement Details - -[Description of feature improvement/addition] - -### Technical Implementation - -- [Key architectural decisions] -- [New dependencies or patterns] -- [Performance considerations] - -### User Experience Impact - -[How this affects end users] - -### Testing Strategy - -[Approach to testing this enhancement] - -🤖 Generated with [Claude Code](https://claude.ai/code) -``` - -## Example Usage Patterns - -### Creating PR: - -1. Create branch and make changes -2. Stage, commit, push → triggers PR creation -3. Each subsequent push triggers update comment -4. By default assume the PR is *wip* (work in progress) so open it appropriately - -### Commit Message Conventions - -See **[docs/GIT_STYLE.md](docs/GIT_STYLE.md)** for full guide. - -- `feat:` - New features -- `fix:` - Bug fixes -- `refactor:` - Code refactoring -- `docs:` - Documentation changes -- `test:` - Test additions/modifications -- `chore:` - Maintenance tasks -- `style:` - Formatting changes -- `content:` - Content changes (blog, copy) -- `perf:` - Performance improvements - -### Branch Naming Conventions - -Always use `//` format: - -- `username/features/description` - New features -- `username/fix/description` - Bug fixes -- `username/refactor/description` - Code refactoring -- `username/docs/description` - Documentation updates -- `username/test/description` - Test additions \ No newline at end of file diff --git a/.gitignore b/.gitignore index f483273..8a3b82e 100644 --- a/.gitignore +++ b/.gitignore @@ -5,4 +5,5 @@ bun.lock *.tsbuildinfo .env doc/ -.DS_Store +.claude/ +CLAUDE.md diff --git a/README.md b/README.md index 1af72b7..bd8b669 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,13 @@ -# ScrapeGraph JS SDK +# ScrapeGraphAI JS SDK [![npm version](https://badge.fury.io/js/scrapegraph-js.svg)](https://badge.fury.io/js/scrapegraph-js) [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) -

- ScrapeGraph API Banner +

+ ScrapeGraphAI JS SDK

-Official TypeScript SDK for the [ScrapeGraph AI API](https://scrapegraphai.com). Zero dependencies. +Official TypeScript SDK for the [ScrapeGraphAI AI API](https://scrapegraphai.com). ## Install @@ -20,15 +20,18 @@ bun add scrapegraph-js ## Quick Start ```ts -import { smartScraper } from "scrapegraph-js"; +import { ScrapeGraphAI } from "scrapegraph-js"; -const result = await smartScraper("your-api-key", { - user_prompt: "Extract the page title and description", - website_url: "https://example.com", +// reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI({ apiKey: "..." }) +const sgai = ScrapeGraphAI(); + +const result = await sgai.scrape({ + url: "https://example.com", + formats: [{ type: "markdown" }], }); if (result.status === "success") { - console.log(result.data); + console.log(result.data?.results.markdown?.data); } else { console.error(result.error); } @@ -47,187 +50,175 @@ type ApiResult = { ## API -All functions take `(apiKey, params)` where `params` is a typed object. - -### smartScraper +### scrape -Extract structured data from a webpage using AI. +Scrape a webpage in multiple formats (markdown, html, screenshot, json, etc). ```ts -const res = await smartScraper("key", { - user_prompt: "Extract product names and prices", - website_url: "https://example.com", - output_schema: { /* JSON schema */ }, // optional - number_of_scrolls: 5, // optional, 0-50 - total_pages: 3, // optional, 1-100 - stealth: true, // optional, +4 credits - cookies: { session: "abc" }, // optional - headers: { "Accept-Language": "en" }, // optional - steps: ["Click 'Load More'"], // optional, browser actions - wait_ms: 5000, // optional, default 3000 - country_code: "us", // optional, proxy routing - mock: true, // optional, testing mode +const res = await sgai.scrape({ + url: "https://example.com", + formats: [ + { type: "markdown", mode: "reader" }, + { type: "screenshot", fullPage: true, width: 1440, height: 900 }, + { type: "json", prompt: "Extract product info" }, + ], + contentType: "text/html", // optional, auto-detected + fetchConfig: { // optional + mode: "js", // "auto" | "fast" | "js" + stealth: true, + timeout: 30000, + wait: 2000, + scrolls: 3, + headers: { "Accept-Language": "en" }, + cookies: { session: "abc" }, + country: "us", + }, }); ``` -### searchScraper - -Search the web and extract structured results. - -```ts -const res = await searchScraper("key", { - user_prompt: "Latest TypeScript release features", - num_results: 5, // optional, 3-20 - extraction_mode: true, // optional, false for markdown - output_schema: { /* */ }, // optional - stealth: true, // optional, +4 credits - time_range: "past_week", // optional, past_hour|past_24_hours|past_week|past_month|past_year - location_geo_code: "us", // optional, geographic targeting - mock: true, // optional, testing mode -}); -// res.data.result (extraction mode) or res.data.markdown_content (markdown mode) -``` +**Formats:** +- `markdown` — Clean markdown (modes: `normal`, `reader`, `prune`) +- `html` — Raw HTML (modes: `normal`, `reader`, `prune`) +- `links` — All links on the page +- `images` — All image URLs +- `summary` — AI-generated summary +- `json` — Structured extraction with prompt/schema +- `branding` — Brand colors, typography, logos +- `screenshot` — Page screenshot (fullPage, width, height, quality) -### markdownify +### extract -Convert a webpage to clean markdown. +Extract structured data from a URL, HTML, or markdown using AI. ```ts -const res = await markdownify("key", { - website_url: "https://example.com", - stealth: true, // optional, +4 credits - wait_ms: 5000, // optional, default 3000 - country_code: "us", // optional, proxy routing - mock: true, // optional, testing mode +const res = await sgai.extract({ + url: "https://example.com", + prompt: "Extract product names and prices", + schema: { /* JSON schema */ }, // optional + mode: "reader", // optional + fetchConfig: { /* ... */ }, // optional }); -// res.data.result is the markdown string +// Or pass html/markdown directly instead of url ``` -### scrape +### search -Get raw HTML from a webpage. +Search the web and optionally extract structured data. ```ts -const res = await scrape("key", { - website_url: "https://example.com", - stealth: true, // optional, +4 credits - branding: true, // optional, extract brand design - country_code: "us", // optional, proxy routing - wait_ms: 5000, // optional, default 3000 +const res = await sgai.search({ + query: "best programming languages 2024", + numResults: 5, // 1-20, default 3 + format: "markdown", // "markdown" | "html" + prompt: "Extract key points", // optional, for AI extraction + schema: { /* ... */ }, // optional + timeRange: "past_week", // optional + locationGeoCode: "us", // optional + fetchConfig: { /* ... */ }, // optional }); -// res.data.html is the HTML string -// res.data.scrape_request_id is the request identifier ``` ### crawl -Crawl a website and its linked pages. Async — polls until completion. +Crawl a website and its linked pages. ```ts -const res = await crawl( - "key", - { - url: "https://example.com", - prompt: "Extract company info", // required when extraction_mode=true - max_pages: 10, // optional, default 10 - depth: 2, // optional, default 1 - breadth: 5, // optional, max links per depth - schema: { /* JSON schema */ }, // optional - sitemap: true, // optional - stealth: true, // optional, +4 credits - wait_ms: 5000, // optional, default 3000 - batch_size: 3, // optional, default 1 - same_domain_only: true, // optional, default true - cache_website: true, // optional - headers: { "Accept-Language": "en" }, // optional - }, - (status) => console.log(status), // optional poll callback -); -``` - -### agenticScraper +// Start a crawl +const start = await sgai.crawl.start({ + url: "https://example.com", + formats: [{ type: "markdown" }], + maxPages: 50, + maxDepth: 2, + maxLinksPerPage: 10, + includePatterns: ["/blog/*"], + excludePatterns: ["/admin/*"], + fetchConfig: { /* ... */ }, +}); -Automate browser actions (click, type, navigate) then extract data. +// Check status +const status = await sgai.crawl.get(start.data?.id!); -```ts -const res = await agenticScraper("key", { - url: "https://example.com/login", - steps: ["Type user@example.com in email", "Click login button"], // required - user_prompt: "Extract dashboard data", // required when ai_extraction=true - output_schema: { /* */ }, // required when ai_extraction=true - ai_extraction: true, // optional - use_session: true, // optional -}); +// Control +await sgai.crawl.stop(id); +await sgai.crawl.resume(id); +await sgai.crawl.delete(id); ``` -### generateSchema +### monitor -Generate a JSON schema from a natural language description. +Monitor a webpage for changes on a schedule. ```ts -const res = await generateSchema("key", { - user_prompt: "Schema for a product with name, price, and rating", - existing_schema: { /* modify this */ }, // optional +// Create a monitor +const mon = await sgai.monitor.create({ + url: "https://example.com", + name: "Price Monitor", + interval: "0 * * * *", // cron expression + formats: [{ type: "markdown" }], + webhookUrl: "https://...", // optional + fetchConfig: { /* ... */ }, }); + +// Manage monitors +await sgai.monitor.list(); +await sgai.monitor.get(cronId); +await sgai.monitor.update(cronId, { interval: "0 */6 * * *" }); +await sgai.monitor.pause(cronId); +await sgai.monitor.resume(cronId); +await sgai.monitor.delete(cronId); ``` -### sitemap +### history -Extract all URLs from a website's sitemap. +Fetch request history. ```ts -const res = await sitemap("key", { - website_url: "https://example.com", - headers: { /* */ }, // optional - stealth: true, // optional, +4 credits - mock: true, // optional, testing mode +const list = await sgai.history.list({ + service: "scrape", // optional filter + page: 1, + limit: 20, }); -// res.data.urls is string[] -``` - -### getCredits / checkHealth -```ts -const credits = await getCredits("key"); -// { remaining_credits: 420, total_credits_used: 69 } - -const health = await checkHealth("key"); -// { status: "healthy" } +const entry = await sgai.history.get("request-id"); ``` -### history - -Fetch request history for any service. +### credits / healthy ```ts -const res = await history("key", { - service: "smartscraper", - page: 1, // optional, default 1 - page_size: 10, // optional, default 10 -}); +const credits = await sgai.credits(); +// { remaining: 1000, used: 500, plan: "pro", jobs: { crawl: {...}, monitor: {...} } } + +const health = await sgai.healthy(); +// { status: "ok", uptime: 12345 } ``` ## Examples -Find complete working examples in the [`examples/`](https://github.com/ScrapeGraphAI/scrapegraph-js/tree/main/examples) directory: - -| Service | Examples | -|---|---| -| [SmartScraper](https://github.com/ScrapeGraphAI/scrapegraph-js/tree/main/examples/smartscraper) | basic, cookies, html input, infinite scroll, markdown input, pagination, stealth, with schema | -| [SearchScraper](https://github.com/ScrapeGraphAI/scrapegraph-js/tree/main/examples/searchscraper) | basic, markdown mode, with schema | -| [Markdownify](https://github.com/ScrapeGraphAI/scrapegraph-js/tree/main/examples/markdownify) | basic, stealth | -| [Scrape](https://github.com/ScrapeGraphAI/scrapegraph-js/tree/main/examples/scrape) | basic, stealth, with branding | -| [Crawl](https://github.com/ScrapeGraphAI/scrapegraph-js/tree/main/examples/crawl) | basic, markdown mode, with schema | -| [Agentic Scraper](https://github.com/ScrapeGraphAI/scrapegraph-js/tree/main/examples/agenticscraper) | basic, AI extraction | -| [Schema Generation](https://github.com/ScrapeGraphAI/scrapegraph-js/tree/main/examples/schema) | basic, modify existing | -| [Sitemap](https://github.com/ScrapeGraphAI/scrapegraph-js/tree/main/examples/sitemap) | basic, with smartscraper | -| [Utilities](https://github.com/ScrapeGraphAI/scrapegraph-js/tree/main/examples/utilities) | credits, health, history | +| Service | Example | Description | +|---------|---------|-------------| +| scrape | [`scrape_basic.ts`](examples/scrape/scrape_basic.ts) | Basic markdown scraping | +| scrape | [`scrape_multi_format.ts`](examples/scrape/scrape_multi_format.ts) | Multiple formats (markdown, links, images, screenshot, summary) | +| scrape | [`scrape_json_extraction.ts`](examples/scrape/scrape_json_extraction.ts) | Structured JSON extraction with schema | +| scrape | [`scrape_pdf.ts`](examples/scrape/scrape_pdf.ts) | PDF document parsing with OCR metadata | +| scrape | [`scrape_with_fetchconfig.ts`](examples/scrape/scrape_with_fetchconfig.ts) | JS rendering, stealth mode, scrolling | +| extract | [`extract_basic.ts`](examples/extract/extract_basic.ts) | AI data extraction from URL | +| extract | [`extract_with_schema.ts`](examples/extract/extract_with_schema.ts) | Extraction with JSON schema | +| search | [`search_basic.ts`](examples/search/search_basic.ts) | Web search with results | +| search | [`search_with_extraction.ts`](examples/search/search_with_extraction.ts) | Search + AI extraction | +| crawl | [`crawl_basic.ts`](examples/crawl/crawl_basic.ts) | Start and monitor a crawl | +| crawl | [`crawl_with_formats.ts`](examples/crawl/crawl_with_formats.ts) | Crawl with screenshots and patterns | +| monitor | [`monitor_basic.ts`](examples/monitor/monitor_basic.ts) | Create a page monitor | +| monitor | [`monitor_with_webhook.ts`](examples/monitor/monitor_with_webhook.ts) | Monitor with webhook notifications | +| utilities | [`credits.ts`](examples/utilities/credits.ts) | Check account credits and limits | +| utilities | [`health.ts`](examples/utilities/health.ts) | API health check | +| utilities | [`history.ts`](examples/utilities/history.ts) | Request history | ## Environment Variables | Variable | Description | Default | -|---|---|---| -| `SGAI_API_URL` | Override API base URL | `https://api.scrapegraphai.com/v1` | +|----------|-------------|---------| +| `SGAI_API_KEY` | Your ScrapeGraphAI API key | — | +| `SGAI_API_URL` | Override API base URL | `https://api.scrapegraphai.com/v2` | | `SGAI_DEBUG` | Enable debug logging (`"1"`) | off | | `SGAI_TIMEOUT_S` | Request timeout in seconds | `120` | @@ -235,11 +226,12 @@ Find complete working examples in the [`examples/`](https://github.com/ScrapeGra ```bash bun install -bun test # 21 tests -bun run build # tsup → dist/ -bun run check # tsc --noEmit + biome +bun run test # unit tests +bun run test:integration # live API tests (requires SGAI_API_KEY) +bun run build # tsup → dist/ +bun run check # tsc --noEmit + biome ``` ## License -MIT - [ScrapeGraph AI](https://scrapegraphai.com) +MIT - [ScrapeGraphAI AI](https://scrapegraphai.com) diff --git a/examples/agenticscraper/agenticscraper_ai_extraction.ts b/examples/agenticscraper/agenticscraper_ai_extraction.ts deleted file mode 100644 index db90aa5..0000000 --- a/examples/agenticscraper/agenticscraper_ai_extraction.ts +++ /dev/null @@ -1,35 +0,0 @@ -import { agenticScraper } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const schema = { - type: "object", - properties: { - username: { type: "string" }, - email: { type: "string" }, - available_sections: { type: "array", items: { type: "string" } }, - credits_remaining: { type: "number" }, - }, - required: ["username", "available_sections"], -}; - -const res = await agenticScraper(apiKey, { - url: "https://dashboard.scrapegraphai.com/", - steps: [ - "Type email@gmail.com in email input box", - "Type test-password@123 in password input box", - "Click on login", - "Wait for dashboard to load completely", - ], - use_session: true, - ai_extraction: true, - user_prompt: - "Extract the user's dashboard info: username, email, available sections, and remaining credits", - output_schema: schema, -}); - -if (res.status === "success") { - console.log("Dashboard Info:", JSON.stringify(res.data?.result, null, 2)); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/agenticscraper/agenticscraper_basic.ts b/examples/agenticscraper/agenticscraper_basic.ts deleted file mode 100644 index 04f6ea9..0000000 --- a/examples/agenticscraper/agenticscraper_basic.ts +++ /dev/null @@ -1,22 +0,0 @@ -import { agenticScraper } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const res = await agenticScraper(apiKey, { - url: "https://dashboard.scrapegraphai.com/", - steps: [ - "Type email@gmail.com in email input box", - "Type test-password@123 in password input box", - "Click on login", - ], - use_session: true, - ai_extraction: false, -}); - -if (res.status === "success") { - console.log("Request ID:", res.data?.request_id); - console.log("Status:", res.data?.status); - console.log("Result:", JSON.stringify(res.data?.result, null, 2)); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/crawl/crawl_basic.ts b/examples/crawl/crawl_basic.ts index 5cd34f2..f0aeb57 100644 --- a/examples/crawl/crawl_basic.ts +++ b/examples/crawl/crawl_basic.ts @@ -1,23 +1,21 @@ -import { crawl } from "scrapegraph-js"; +import { ScrapeGraphAI } from "scrapegraph-js"; -const apiKey = process.env.SGAI_API_KEY!; +// reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI({ apiKey: "..." }) +const sgai = ScrapeGraphAI(); -const res = await crawl( - apiKey, - { - url: "https://scrapegraphai.com", - prompt: "Extract the main content from each page", - max_pages: 5, - depth: 2, - sitemap: true, - }, - (status) => console.log(`Poll: ${status}`), -); +const startRes = await sgai.crawl.start({ + url: "https://example.com", + maxPages: 5, + maxDepth: 2, +}); -if (res.status === "success") { - console.log("Pages crawled:", res.data?.crawled_urls?.length); - console.log("Result:", JSON.stringify(res.data?.llm_result, null, 2)); - console.log(`Took ${res.elapsedMs}ms`); +if (startRes.status !== "success" || !startRes.data) { + console.error("Failed to start:", startRes.error); } else { - console.error("Failed:", res.error); + console.log("Crawl started:", startRes.data.id); + console.log("Status:", startRes.data.status); + + const getRes = await sgai.crawl.get(startRes.data.id); + console.log("\nProgress:", getRes.data?.finished, "/", getRes.data?.total); + console.log("Pages:", getRes.data?.pages.map((p) => p.url)); } diff --git a/examples/crawl/crawl_markdown.ts b/examples/crawl/crawl_markdown.ts deleted file mode 100644 index e0021ef..0000000 --- a/examples/crawl/crawl_markdown.ts +++ /dev/null @@ -1,28 +0,0 @@ -import { crawl } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -// extraction_mode: false returns raw markdown for each page -const res = await crawl( - apiKey, - { - url: "https://scrapegraphai.com", - extraction_mode: false, - max_pages: 5, - depth: 2, - sitemap: true, - }, - (status) => console.log(`Poll: ${status}`), -); - -if (res.status === "success") { - console.log(`Crawled ${res.data?.pages?.length ?? 0} pages\n`); - for (const page of res.data?.pages ?? []) { - console.log(`--- ${page.url} ---`); - console.log(page.markdown.slice(0, 500)); - console.log("...\n"); - } - console.log(`Took ${res.elapsedMs}ms`); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/crawl/crawl_with_formats.ts b/examples/crawl/crawl_with_formats.ts new file mode 100644 index 0000000..aab74c1 --- /dev/null +++ b/examples/crawl/crawl_with_formats.ts @@ -0,0 +1,24 @@ +import { ScrapeGraphAI } from "scrapegraph-js"; + +// reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI({ apiKey: "..." }) +const sgai = ScrapeGraphAI(); + +const res = await sgai.crawl.start({ + url: "https://example.com", + formats: [ + { type: "markdown", mode: "reader" }, + { type: "screenshot", width: 1280, height: 720 }, + ], + maxPages: 10, + maxDepth: 2, + includePatterns: ["/blog/*", "/docs/*"], + excludePatterns: ["/admin/*"], +}); + +if (res.status === "success") { + console.log("Crawl ID:", res.data?.id); + console.log("Status:", res.data?.status); + console.log("Total pages to crawl:", res.data?.total); +} else { + console.error("Failed:", res.error); +} diff --git a/examples/crawl/crawl_with_schema.ts b/examples/crawl/crawl_with_schema.ts deleted file mode 100644 index f236b2a..0000000 --- a/examples/crawl/crawl_with_schema.ts +++ /dev/null @@ -1,50 +0,0 @@ -import { crawl } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const schema = { - type: "object", - properties: { - company: { - type: "object", - properties: { - name: { type: "string" }, - description: { type: "string" }, - features: { type: "array", items: { type: "string" } }, - }, - required: ["name", "description"], - }, - services: { - type: "array", - items: { - type: "object", - properties: { - service_name: { type: "string" }, - description: { type: "string" }, - }, - required: ["service_name", "description"], - }, - }, - }, - required: ["company", "services"], -}; - -const res = await crawl( - apiKey, - { - url: "https://scrapegraphai.com", - prompt: "Extract company info, services, and features", - schema, - max_pages: 3, - depth: 2, - sitemap: true, - }, - (status) => console.log(`Poll: ${status}`), -); - -if (res.status === "success") { - console.log("Result:", JSON.stringify(res.data?.llm_result, null, 2)); - console.log(`Took ${res.elapsedMs}ms`); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/extract/extract_basic.ts b/examples/extract/extract_basic.ts new file mode 100644 index 0000000..73992ef --- /dev/null +++ b/examples/extract/extract_basic.ts @@ -0,0 +1,16 @@ +import { ScrapeGraphAI } from "scrapegraph-js"; + +// reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI({ apiKey: "..." }) +const sgai = ScrapeGraphAI(); + +const res = await sgai.extract({ + url: "https://example.com", + prompt: "What is this page about? Extract the main heading and description.", +}); + +if (res.status === "success") { + console.log("Extracted:", JSON.stringify(res.data?.json, null, 2)); + console.log("\nTokens used:", res.data?.usage); +} else { + console.error("Failed:", res.error); +} diff --git a/examples/extract/extract_with_schema.ts b/examples/extract/extract_with_schema.ts new file mode 100644 index 0000000..c09611e --- /dev/null +++ b/examples/extract/extract_with_schema.ts @@ -0,0 +1,23 @@ +import { ScrapeGraphAI } from "scrapegraph-js"; + +// reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI({ apiKey: "..." }) +const sgai = ScrapeGraphAI(); + +const res = await sgai.extract({ + url: "https://example.com", + prompt: "Extract the page title and description", + schema: { + type: "object", + properties: { + title: { type: "string" }, + description: { type: "string" }, + }, + required: ["title"], + }, +}); + +if (res.status === "success") { + console.log("Extracted:", JSON.stringify(res.data?.json, null, 2)); +} else { + console.error("Failed:", res.error); +} diff --git a/examples/markdownify/markdownify_basic.ts b/examples/markdownify/markdownify_basic.ts deleted file mode 100644 index b8bda56..0000000 --- a/examples/markdownify/markdownify_basic.ts +++ /dev/null @@ -1,13 +0,0 @@ -import { markdownify } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const res = await markdownify(apiKey, { - website_url: "https://scrapegraphai.com", -}); - -if (res.status === "success") { - console.log(res.data?.result); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/markdownify/markdownify_stealth.ts b/examples/markdownify/markdownify_stealth.ts deleted file mode 100644 index 056d54d..0000000 --- a/examples/markdownify/markdownify_stealth.ts +++ /dev/null @@ -1,17 +0,0 @@ -import { markdownify } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const res = await markdownify(apiKey, { - website_url: "https://example.com", - stealth: true, - headers: { - "Accept-Language": "en-US,en;q=0.9", - }, -}); - -if (res.status === "success") { - console.log(res.data?.result); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/monitor/monitor_basic.ts b/examples/monitor/monitor_basic.ts new file mode 100644 index 0000000..cdb227c --- /dev/null +++ b/examples/monitor/monitor_basic.ts @@ -0,0 +1,19 @@ +import { ScrapeGraphAI } from "scrapegraph-js"; + +// reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI({ apiKey: "..." }) +const sgai = ScrapeGraphAI(); + +const res = await sgai.monitor.create({ + url: "https://example.com", + name: "Example Monitor", + interval: "0 * * * *", + formats: [{ type: "markdown" }], +}); + +if (res.status === "success") { + console.log("Monitor created:", res.data?.cronId); + console.log("Status:", res.data?.status); + console.log("Interval:", res.data?.interval); +} else { + console.error("Failed:", res.error); +} diff --git a/examples/monitor/monitor_with_webhook.ts b/examples/monitor/monitor_with_webhook.ts new file mode 100644 index 0000000..ddbaa77 --- /dev/null +++ b/examples/monitor/monitor_with_webhook.ts @@ -0,0 +1,22 @@ +import { ScrapeGraphAI } from "scrapegraph-js"; + +// reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI({ apiKey: "..." }) +const sgai = ScrapeGraphAI(); + +const res = await sgai.monitor.create({ + url: "https://example.com/prices", + name: "Price Monitor", + interval: "0 */6 * * *", + formats: [ + { type: "markdown" }, + { type: "json", prompt: "Extract all product prices" }, + ], + webhookUrl: "https://your-server.com/webhook", +}); + +if (res.status === "success") { + console.log("Monitor created:", res.data?.cronId); + console.log("Will notify:", res.data?.config.webhookUrl); +} else { + console.error("Failed:", res.error); +} diff --git a/examples/schema/generate_schema_basic.ts b/examples/schema/generate_schema_basic.ts deleted file mode 100644 index 4efca04..0000000 --- a/examples/schema/generate_schema_basic.ts +++ /dev/null @@ -1,16 +0,0 @@ -import { generateSchema } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const res = await generateSchema(apiKey, { - user_prompt: - "Find laptops with specifications like brand, processor, RAM, storage, and price", -}); - -if (res.status === "success") { - console.log("Refined prompt:", res.data?.refined_prompt); - console.log("\nGenerated schema:"); - console.log(JSON.stringify(res.data?.generated_schema, null, 2)); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/schema/modify_existing_schema.ts b/examples/schema/modify_existing_schema.ts deleted file mode 100644 index d75e4a7..0000000 --- a/examples/schema/modify_existing_schema.ts +++ /dev/null @@ -1,34 +0,0 @@ -import { generateSchema } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const existingSchema = { - title: "ProductList", - type: "object", - properties: { - products: { - type: "array", - items: { - type: "object", - properties: { - name: { type: "string" }, - price: { type: "number" }, - }, - required: ["name", "price"], - }, - }, - }, - required: ["products"], -}; - -const res = await generateSchema(apiKey, { - user_prompt: "Add brand, category, and rating fields to the existing product schema", - existing_schema: existingSchema, -}); - -if (res.status === "success") { - console.log("Modified schema:"); - console.log(JSON.stringify(res.data?.generated_schema, null, 2)); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/scrape/scrape_basic.ts b/examples/scrape/scrape_basic.ts index 7531f95..0d34e05 100644 --- a/examples/scrape/scrape_basic.ts +++ b/examples/scrape/scrape_basic.ts @@ -1,14 +1,15 @@ -import { scrape } from "scrapegraph-js"; +import { ScrapeGraphAI } from "scrapegraph-js"; -const apiKey = process.env.SGAI_API_KEY!; +// reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI({ apiKey: "..." }) +const sgai = ScrapeGraphAI(); -const res = await scrape(apiKey, { - website_url: "https://example.com", +const res = await sgai.scrape({ + url: "https://example.com", + formats: [{ type: "markdown" }], }); if (res.status === "success") { - console.log(`HTML length: ${res.data?.html.length} chars`); - console.log("Preview:", res.data?.html.slice(0, 500)); + console.log("Markdown:", res.data?.results.markdown?.data); console.log(`\nTook ${res.elapsedMs}ms`); } else { console.error("Failed:", res.error); diff --git a/examples/scrape/scrape_json_extraction.ts b/examples/scrape/scrape_json_extraction.ts new file mode 100644 index 0000000..60430d6 --- /dev/null +++ b/examples/scrape/scrape_json_extraction.ts @@ -0,0 +1,42 @@ +import { ScrapeGraphAI } from "scrapegraph-js"; + +// reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI({ apiKey: "..." }) +const sgai = ScrapeGraphAI(); + +const res = await sgai.scrape({ + url: "https://example.com", + formats: [ + { + type: "json", + prompt: "Extract the company name, tagline, and list of features", + schema: { + type: "object", + properties: { + companyName: { type: "string" }, + tagline: { type: "string" }, + features: { + type: "array", + items: { type: "string" }, + }, + }, + required: ["companyName"], + }, + }, + ], +}); + +if (res.status === "success") { + const json = res.data?.results.json; + + console.log("=== JSON Extraction ===\n"); + console.log("Extracted data:"); + console.log(JSON.stringify(json?.data, null, 2)); + + if (json?.metadata?.chunker) { + console.log("\nChunker info:"); + console.log(" Chunks:", json.metadata.chunker.chunks.length); + console.log(" Total size:", json.metadata.chunker.chunks.reduce((a, c) => a + c.size, 0), "chars"); + } +} else { + console.error("Failed:", res.error); +} diff --git a/examples/scrape/scrape_multi_format.ts b/examples/scrape/scrape_multi_format.ts new file mode 100644 index 0000000..52783db --- /dev/null +++ b/examples/scrape/scrape_multi_format.ts @@ -0,0 +1,62 @@ +import { ScrapeGraphAI } from "scrapegraph-js"; + +// reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI({ apiKey: "..." }) +const sgai = ScrapeGraphAI(); + +const res = await sgai.scrape({ + url: "https://example.com", + formats: [ + { type: "markdown", mode: "reader" }, + { type: "html", mode: "prune" }, + { type: "links" }, + { type: "images" }, + { type: "summary" }, + { type: "screenshot", fullPage: false, width: 1440, height: 900, quality: 90 }, + ], +}); + +if (res.status === "success") { + const results = res.data?.results; + + console.log("=== Scrape Results ===\n"); + console.log("Provider:", res.data?.metadata.provider); + console.log("Content-Type:", res.data?.metadata.contentType); + console.log("Elapsed:", res.elapsedMs, "ms\n"); + + if (results?.markdown) { + console.log("--- Markdown ---"); + console.log("Length:", results.markdown.data?.join("").length, "chars"); + console.log("Preview:", results.markdown.data?.[0]?.slice(0, 200), "...\n"); + } + + if (results?.html) { + console.log("--- HTML ---"); + console.log("Length:", results.html.data?.join("").length, "chars\n"); + } + + if (results?.links) { + console.log("--- Links ---"); + console.log("Count:", results.links.metadata?.count); + console.log("Sample:", results.links.data?.slice(0, 5), "\n"); + } + + if (results?.images) { + console.log("--- Images ---"); + console.log("Count:", results.images.metadata?.count); + console.log("Sample:", results.images.data?.slice(0, 3), "\n"); + } + + if (results?.summary) { + console.log("--- Summary ---"); + console.log(results.summary.data, "\n"); + } + + if (results?.screenshot) { + console.log("--- Screenshot ---"); + console.log("URL:", results.screenshot.data.url); + console.log("Dimensions:", results.screenshot.data.width, "x", results.screenshot.data.height); + console.log("Format:", results.screenshot.metadata?.contentType, "\n"); + } +} else { + console.error("Failed:", res.error); +} diff --git a/examples/scrape/scrape_pdf.ts b/examples/scrape/scrape_pdf.ts new file mode 100644 index 0000000..4a771d9 --- /dev/null +++ b/examples/scrape/scrape_pdf.ts @@ -0,0 +1,35 @@ +import { ScrapeGraphAI } from "scrapegraph-js"; + +// reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI({ apiKey: "..." }) +const sgai = ScrapeGraphAI(); + +const res = await sgai.scrape({ + url: "https://pdfobject.com/pdf/sample.pdf", + contentType: "application/pdf", + formats: [{ type: "markdown" }], +}); + +if (res.status === "success") { + const md = res.data?.results.markdown; + const ocr = res.data?.metadata.ocr; + + console.log("=== PDF Extraction ===\n"); + console.log("Content Type:", res.data?.metadata.contentType); + console.log("OCR Model:", ocr?.model); + console.log("Pages Processed:", ocr?.pagesProcessed); + + if (ocr?.pages) { + for (const page of ocr.pages) { + console.log(`\nPage ${page.index + 1}:`); + console.log(` Dimensions: ${page.dimensions.width}x${page.dimensions.height} @ ${page.dimensions.dpi}dpi`); + console.log(` Images: ${page.images.length}`); + console.log(` Tables: ${page.tables.length}`); + console.log(` Hyperlinks: ${page.hyperlinks.length}`); + } + } + + console.log("\n=== Extracted Markdown ===\n"); + console.log(md?.data?.join("\n\n")); +} else { + console.error("Failed:", res.error); +} diff --git a/examples/scrape/scrape_stealth.ts b/examples/scrape/scrape_stealth.ts deleted file mode 100644 index 9bbf76e..0000000 --- a/examples/scrape/scrape_stealth.ts +++ /dev/null @@ -1,17 +0,0 @@ -import { scrape } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const res = await scrape(apiKey, { - website_url: "https://example.com", - stealth: true, - country_code: "us", -}); - -if (res.status === "success") { - console.log(`HTML length: ${res.data?.html.length} chars`); - console.log("Preview:", res.data?.html.slice(0, 500)); - console.log(`\nTook ${res.elapsedMs}ms`); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/scrape/scrape_with_branding.ts b/examples/scrape/scrape_with_branding.ts deleted file mode 100644 index 9eac191..0000000 --- a/examples/scrape/scrape_with_branding.ts +++ /dev/null @@ -1,16 +0,0 @@ -import { scrape } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const res = await scrape(apiKey, { - website_url: "https://example.com", - branding: true, -}); - -if (res.status === "success") { - console.log("Branding:", JSON.stringify(res.data?.branding, null, 2)); - console.log(`HTML length: ${res.data?.html.length} chars`); - console.log(`\nTook ${res.elapsedMs}ms`); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/scrape/scrape_with_fetchconfig.ts b/examples/scrape/scrape_with_fetchconfig.ts new file mode 100644 index 0000000..30fbf49 --- /dev/null +++ b/examples/scrape/scrape_with_fetchconfig.ts @@ -0,0 +1,23 @@ +import { ScrapeGraphAI } from "scrapegraph-js"; + +// reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI({ apiKey: "..." }) +const sgai = ScrapeGraphAI(); + +const res = await sgai.scrape({ + url: "https://example.com", + fetchConfig: { + mode: "js", + stealth: true, + timeout: 45000, + wait: 2000, + scrolls: 3, + }, + formats: [{ type: "markdown" }], +}); + +if (res.status === "success") { + console.log("Content:", res.data?.results.markdown?.data); + console.log("\nProvider:", res.data?.metadata.provider); +} else { + console.error("Failed:", res.error); +} diff --git a/examples/search/search_basic.ts b/examples/search/search_basic.ts new file mode 100644 index 0000000..f224aa8 --- /dev/null +++ b/examples/search/search_basic.ts @@ -0,0 +1,19 @@ +import { ScrapeGraphAI } from "scrapegraph-js"; + +// reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI({ apiKey: "..." }) +const sgai = ScrapeGraphAI(); + +const res = await sgai.search({ + query: "best programming languages 2024", + numResults: 3, +}); + +if (res.status === "success") { + for (const result of res.data?.results ?? []) { + console.log(`\n${result.title}`); + console.log(`URL: ${result.url}`); + console.log(`Content: ${result.content.slice(0, 200)}...`); + } +} else { + console.error("Failed:", res.error); +} diff --git a/examples/search/search_with_extraction.ts b/examples/search/search_with_extraction.ts new file mode 100644 index 0000000..967bd5f --- /dev/null +++ b/examples/search/search_with_extraction.ts @@ -0,0 +1,26 @@ +import { ScrapeGraphAI } from "scrapegraph-js"; + +// reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI({ apiKey: "..." }) +const sgai = ScrapeGraphAI(); + +const res = await sgai.search({ + query: "typescript best practices", + numResults: 5, + prompt: "Extract the main tips and recommendations", + schema: { + type: "object", + properties: { + tips: { + type: "array", + items: { type: "string" }, + }, + }, + }, +}); + +if (res.status === "success") { + console.log("Search results:", res.data?.results.length); + console.log("\nExtracted tips:", JSON.stringify(res.data?.json, null, 2)); +} else { + console.error("Failed:", res.error); +} diff --git a/examples/searchscraper/searchscraper_basic.ts b/examples/searchscraper/searchscraper_basic.ts deleted file mode 100644 index 78e56a2..0000000 --- a/examples/searchscraper/searchscraper_basic.ts +++ /dev/null @@ -1,16 +0,0 @@ -import { searchScraper } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const res = await searchScraper(apiKey, { - user_prompt: "What is the latest version of Python and what are its main features?", - num_results: 3, -}); - -if (res.status === "success") { - console.log("Result:", JSON.stringify(res.data?.result, null, 2)); - console.log("\nReference URLs:"); - res.data?.reference_urls.forEach((url, i) => console.log(` ${i + 1}. ${url}`)); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/searchscraper/searchscraper_markdown.ts b/examples/searchscraper/searchscraper_markdown.ts deleted file mode 100644 index 15f6789..0000000 --- a/examples/searchscraper/searchscraper_markdown.ts +++ /dev/null @@ -1,19 +0,0 @@ -import { searchScraper } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -// extraction_mode: false returns raw markdown instead of AI-extracted data -// costs 2 credits per page vs 10 for AI extraction -const res = await searchScraper(apiKey, { - user_prompt: "Latest developments in artificial intelligence", - num_results: 3, - extraction_mode: false, -}); - -if (res.status === "success") { - console.log("Result:", JSON.stringify(res.data?.result, null, 2)); - console.log("\nReference URLs:"); - res.data?.reference_urls.forEach((url, i) => console.log(` ${i + 1}. ${url}`)); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/searchscraper/searchscraper_with_schema.ts b/examples/searchscraper/searchscraper_with_schema.ts deleted file mode 100644 index 085062d..0000000 --- a/examples/searchscraper/searchscraper_with_schema.ts +++ /dev/null @@ -1,37 +0,0 @@ -import { searchScraper } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const schema = { - type: "object", - properties: { - version: { type: "string" }, - release_date: { type: "string" }, - features: { - type: "array", - items: { - type: "object", - properties: { - name: { type: "string" }, - description: { type: "string" }, - }, - required: ["name", "description"], - }, - }, - }, - required: ["version", "features"], -}; - -const res = await searchScraper(apiKey, { - user_prompt: "What is the latest version of Python and its new features?", - num_results: 5, - output_schema: schema, -}); - -if (res.status === "success") { - console.log("Result:", JSON.stringify(res.data?.result, null, 2)); - console.log("\nReference URLs:"); - res.data?.reference_urls.forEach((url, i) => console.log(` ${i + 1}. ${url}`)); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/sitemap/sitemap_basic.ts b/examples/sitemap/sitemap_basic.ts deleted file mode 100644 index a1ffdd4..0000000 --- a/examples/sitemap/sitemap_basic.ts +++ /dev/null @@ -1,16 +0,0 @@ -import { sitemap } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const res = await sitemap(apiKey, { - website_url: "https://scrapegraphai.com", -}); - -if (res.status === "success") { - const urls = res.data?.urls ?? []; - console.log(`Found ${urls.length} URLs:\n`); - urls.slice(0, 20).forEach((url, i) => console.log(` ${i + 1}. ${url}`)); - if (urls.length > 20) console.log(` ... and ${urls.length - 20} more`); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/sitemap/sitemap_with_smartscraper.ts b/examples/sitemap/sitemap_with_smartscraper.ts deleted file mode 100644 index 6c4a965..0000000 --- a/examples/sitemap/sitemap_with_smartscraper.ts +++ /dev/null @@ -1,30 +0,0 @@ -import { sitemap, smartScraper } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const sitemapRes = await sitemap(apiKey, { - website_url: "https://scrapegraphai.com", -}); - -if (sitemapRes.status !== "success") { - console.error("Sitemap failed:", sitemapRes.error); - process.exit(1); -} - -const urls = sitemapRes.data?.urls ?? []; -console.log(`Found ${urls.length} URLs, scraping first 3...\n`); - -for (const url of urls.slice(0, 3)) { - console.log(`Scraping: ${url}`); - const res = await smartScraper(apiKey, { - user_prompt: "Extract the page title and main content summary", - website_url: url, - }); - - if (res.status === "success") { - console.log(" Result:", JSON.stringify(res.data?.result, null, 2)); - } else { - console.error(" Failed:", res.error); - } - console.log(); -} diff --git a/examples/smartscraper/smartscraper_basic.ts b/examples/smartscraper/smartscraper_basic.ts deleted file mode 100644 index 90dda7f..0000000 --- a/examples/smartscraper/smartscraper_basic.ts +++ /dev/null @@ -1,15 +0,0 @@ -import { smartScraper } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const res = await smartScraper(apiKey, { - user_prompt: "What does the company do? Extract the main heading and description", - website_url: "https://scrapegraphai.com", -}); - -if (res.status === "success") { - console.log("Result:", JSON.stringify(res.data?.result, null, 2)); - console.log(`Took ${res.elapsedMs}ms`); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/smartscraper/smartscraper_cookies.ts b/examples/smartscraper/smartscraper_cookies.ts deleted file mode 100644 index 9674fd8..0000000 --- a/examples/smartscraper/smartscraper_cookies.ts +++ /dev/null @@ -1,16 +0,0 @@ -import { smartScraper } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const res = await smartScraper(apiKey, { - user_prompt: "Extract all cookies info", - website_url: "https://httpbin.org/cookies", - cookies: { session_id: "abc123", user_token: "xyz789" }, -}); - -if (res.status === "success") { - console.log("Cookies:", JSON.stringify(res.data?.result, null, 2)); - console.log(`Took ${res.elapsedMs}ms`); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/smartscraper/smartscraper_html.ts b/examples/smartscraper/smartscraper_html.ts deleted file mode 100644 index b0cfed7..0000000 --- a/examples/smartscraper/smartscraper_html.ts +++ /dev/null @@ -1,47 +0,0 @@ -import { smartScraper } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const html = ` - - - -
-

Laptop Pro 15

-
TechCorp
-
$1,299.99
-
4.5/5
-
In Stock
-

High-performance laptop with 15-inch display, 16GB RAM, and 512GB SSD

-
-
-

Wireless Mouse Elite

-
PeripheralCo
-
$29.99
-
4.8/5
-
In Stock
-

Ergonomic wireless mouse with precision tracking

-
-
-

USB-C Hub Pro

-
ConnectTech
-
$49.99
-
4.3/5
-
Out of Stock
-

7-in-1 USB-C hub with HDMI, USB 3.0, and SD card reader

-
- - -`; - -const res = await smartScraper(apiKey, { - user_prompt: "Extract all products with name, brand, price, rating, and stock status", - website_html: html, -}); - -if (res.status === "success") { - console.log("Products:", JSON.stringify(res.data?.result, null, 2)); - console.log(`Took ${res.elapsedMs}ms`); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/smartscraper/smartscraper_infinite_scroll.ts b/examples/smartscraper/smartscraper_infinite_scroll.ts deleted file mode 100644 index 3e7e008..0000000 --- a/examples/smartscraper/smartscraper_infinite_scroll.ts +++ /dev/null @@ -1,16 +0,0 @@ -import { smartScraper } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const res = await smartScraper(apiKey, { - user_prompt: "Extract all post titles and authors", - website_url: "https://news.ycombinator.com", - number_of_scrolls: 5, -}); - -if (res.status === "success") { - console.log("Posts:", JSON.stringify(res.data?.result, null, 2)); - console.log(`Took ${res.elapsedMs}ms`); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/smartscraper/smartscraper_markdown.ts b/examples/smartscraper/smartscraper_markdown.ts deleted file mode 100644 index 1fbacc3..0000000 --- a/examples/smartscraper/smartscraper_markdown.ts +++ /dev/null @@ -1,40 +0,0 @@ -import { smartScraper } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const markdown = ` -# Product Catalog - -## Laptop Pro 15 -- **Brand**: TechCorp -- **Price**: $1,299.99 -- **Rating**: 4.5/5 -- **In Stock**: Yes -- **Description**: High-performance laptop with 15-inch display, 16GB RAM, and 512GB SSD - -## Wireless Mouse Elite -- **Brand**: PeripheralCo -- **Price**: $29.99 -- **Rating**: 4.8/5 -- **In Stock**: Yes -- **Description**: Ergonomic wireless mouse with precision tracking - -## USB-C Hub Pro -- **Brand**: ConnectTech -- **Price**: $49.99 -- **Rating**: 4.3/5 -- **In Stock**: No -- **Description**: 7-in-1 USB-C hub with HDMI, USB 3.0, and SD card reader -`; - -const res = await smartScraper(apiKey, { - user_prompt: "Extract all products with name, brand, price, rating, and stock status", - website_markdown: markdown, -}); - -if (res.status === "success") { - console.log("Products:", JSON.stringify(res.data?.result, null, 2)); - console.log(`Took ${res.elapsedMs}ms`); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/smartscraper/smartscraper_pagination.ts b/examples/smartscraper/smartscraper_pagination.ts deleted file mode 100644 index 93aa792..0000000 --- a/examples/smartscraper/smartscraper_pagination.ts +++ /dev/null @@ -1,16 +0,0 @@ -import { smartScraper } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const res = await smartScraper(apiKey, { - user_prompt: "Extract all product info including name, price, rating, and image_url", - website_url: "https://www.amazon.in/s?k=tv", - total_pages: 3, -}); - -if (res.status === "success") { - console.log("Products:", JSON.stringify(res.data?.result, null, 2)); - console.log(`Took ${res.elapsedMs}ms`); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/smartscraper/smartscraper_stealth.ts b/examples/smartscraper/smartscraper_stealth.ts deleted file mode 100644 index 48dd2da..0000000 --- a/examples/smartscraper/smartscraper_stealth.ts +++ /dev/null @@ -1,19 +0,0 @@ -import { smartScraper } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const res = await smartScraper(apiKey, { - user_prompt: "Extract the main content and headings", - website_url: "https://example.com", - stealth: true, - headers: { - "Accept-Language": "en-US,en;q=0.9", - }, -}); - -if (res.status === "success") { - console.log("Result:", JSON.stringify(res.data?.result, null, 2)); - console.log(`Took ${res.elapsedMs}ms`); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/smartscraper/smartscraper_with_schema.ts b/examples/smartscraper/smartscraper_with_schema.ts deleted file mode 100644 index d9ca09a..0000000 --- a/examples/smartscraper/smartscraper_with_schema.ts +++ /dev/null @@ -1,36 +0,0 @@ -import { smartScraper } from "scrapegraph-js"; - -const apiKey = process.env.SGAI_API_KEY!; - -const schema = { - type: "object", - properties: { - products: { - type: "array", - items: { - type: "object", - properties: { - name: { type: "string" }, - price: { type: "number" }, - rating: { type: "string" }, - image_url: { type: "string", format: "uri" }, - }, - required: ["name", "price"], - }, - }, - }, - required: ["products"], -}; - -const res = await smartScraper(apiKey, { - user_prompt: "Extract all product info including name, price, rating, and image_url", - website_url: "https://www.amazon.in/s?k=laptop", - output_schema: schema, -}); - -if (res.status === "success") { - console.log("Products:", JSON.stringify(res.data?.result, null, 2)); - console.log(`Took ${res.elapsedMs}ms`); -} else { - console.error("Failed:", res.error); -} diff --git a/examples/utilities/credits.ts b/examples/utilities/credits.ts index 0815236..bef2949 100644 --- a/examples/utilities/credits.ts +++ b/examples/utilities/credits.ts @@ -1,12 +1,17 @@ -import { getCredits } from "scrapegraph-js"; +import { ScrapeGraphAI } from "scrapegraph-js"; -const apiKey = process.env.SGAI_API_KEY!; +// reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI({ apiKey: "..." }) +const sgai = ScrapeGraphAI(); -const res = await getCredits(apiKey); +const res = await sgai.credits(); if (res.status === "success") { - console.log("Remaining credits:", res.data?.remaining_credits); - console.log("Total credits used:", res.data?.total_credits_used); + console.log("Plan:", res.data?.plan); + console.log("Remaining credits:", res.data?.remaining); + console.log("Used credits:", res.data?.used); + console.log("\nJob limits:"); + console.log(" Crawl:", res.data?.jobs.crawl.used, "/", res.data?.jobs.crawl.limit); + console.log(" Monitor:", res.data?.jobs.monitor.used, "/", res.data?.jobs.monitor.limit); } else { console.error("Failed:", res.error); } diff --git a/examples/utilities/health.ts b/examples/utilities/health.ts index 8e17af0..c68a293 100644 --- a/examples/utilities/health.ts +++ b/examples/utilities/health.ts @@ -1,8 +1,9 @@ -import { checkHealth } from "scrapegraph-js"; +import { ScrapeGraphAI } from "scrapegraph-js"; -const apiKey = process.env.SGAI_API_KEY!; +// reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI({ apiKey: "..." }) +const sgai = ScrapeGraphAI(); -const res = await checkHealth(apiKey); +const res = await sgai.healthy(); if (res.status === "success") { console.log("API Status:", res.data?.status); diff --git a/examples/utilities/history.ts b/examples/utilities/history.ts index 89244f4..f6cb220 100644 --- a/examples/utilities/history.ts +++ b/examples/utilities/history.ts @@ -1,20 +1,18 @@ -import { history, HISTORY_SERVICES } from "scrapegraph-js"; +import { ScrapeGraphAI } from "scrapegraph-js"; -const apiKey = process.env.SGAI_API_KEY!; +// reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI({ apiKey: "..." }) +const sgai = ScrapeGraphAI(); -console.log("Available services:", HISTORY_SERVICES.join(", ")); - -const res = await history(apiKey, { - service: "smartscraper", - page: 1, - page_size: 5, +const res = await sgai.history.list({ + service: "scrape", + limit: 5, }); if (res.status === "success") { - console.log(`\nTotal requests: ${res.data?.total_count}`); - console.log(`Page ${res.data?.page} of ${Math.ceil((res.data?.total_count ?? 0) / (res.data?.page_size ?? 10))}\n`); - for (const entry of res.data?.requests ?? []) { - console.log(` [${entry.status}] ${entry.request_id}`); + console.log(`Total: ${res.data?.pagination.total}`); + console.log(`Page ${res.data?.pagination.page}\n`); + for (const entry of res.data?.data ?? []) { + console.log(` [${entry.status}] ${entry.service} - ${entry.id}`); } } else { console.error("Failed:", res.error); diff --git a/integration_test.ts b/integration_test.ts deleted file mode 100644 index 10482eb..0000000 --- a/integration_test.ts +++ /dev/null @@ -1,130 +0,0 @@ -import { - type CreditsResponse, - type HealthResponse, - type MarkdownifyResponse, - type ScrapeResponse, - type SearchScraperResponse, - type SitemapResponse, - type SmartScraperResponse, - checkHealth, - getCredits, - markdownify, - scrape, - searchScraper, - sitemap, - smartScraper, -} from "./src/index.js"; - -const maybeKey = process.env.SGAI_API_KEY; -if (!maybeKey) { - console.error("Set SGAI_API_KEY env var"); - process.exit(1); -} -const apiKey: string = maybeKey; - -function assert(condition: boolean, msg: string) { - if (!condition) { - console.error(`FAIL: ${msg}`); - process.exit(1); - } -} - -function logResult(name: string, data: unknown) { - console.log(`\n=== ${name} ===`); - console.log(JSON.stringify(data, null, 2)); -} - -async function testHealth() { - const res = await checkHealth(apiKey); - logResult("checkHealth", res); - assert(res.status === "success", "health status should be success"); - const d = res.data as HealthResponse; - assert(typeof d.status === "string", "health.status should be string"); -} - -async function testCredits() { - const res = await getCredits(apiKey); - logResult("getCredits", res); - assert(res.status === "success", "credits status should be success"); - const d = res.data as CreditsResponse; - assert(typeof d.remaining_credits === "number", "remaining_credits should be number"); - assert(typeof d.total_credits_used === "number", "total_credits_used should be number"); -} - -async function testSmartScraper() { - const res = await smartScraper(apiKey, { - user_prompt: "Extract the page title and description", - website_url: "https://example.com", - }); - logResult("smartScraper", res); - assert(res.status === "success", "smartScraper status should be success"); - const d = res.data as SmartScraperResponse; - assert(typeof d.request_id === "string", "request_id should be string"); - assert(typeof d.status === "string", "status should be string"); - assert(typeof d.website_url === "string", "website_url should be string"); - assert(typeof d.user_prompt === "string", "user_prompt should be string"); - assert(d.result !== undefined, "result should exist"); -} - -async function testSearchScraper() { - const res = await searchScraper(apiKey, { - user_prompt: "What is the capital of France?", - }); - logResult("searchScraper", res); - assert(res.status === "success", "searchScraper status should be success"); - const d = res.data as SearchScraperResponse; - assert(typeof d.request_id === "string", "request_id should be string"); - assert(typeof d.user_prompt === "string", "user_prompt should be string"); - assert(Array.isArray(d.reference_urls), "reference_urls should be array"); - assert( - d.result !== undefined || d.markdown_content !== undefined, - "result or markdown_content should exist", - ); -} - -async function testMarkdownify() { - const res = await markdownify(apiKey, { - website_url: "https://example.com", - }); - logResult("markdownify", res); - assert(res.status === "success", "markdownify status should be success"); - const d = res.data as MarkdownifyResponse; - assert(typeof d.request_id === "string", "request_id should be string"); - assert(typeof d.website_url === "string", "website_url should be string"); - assert(typeof d.result === "string" || d.result === null, "result should be string or null"); -} - -async function testScrape() { - const res = await scrape(apiKey, { - website_url: "https://example.com", - }); - logResult("scrape", res); - assert(res.status === "success", "scrape status should be success"); - const d = res.data as ScrapeResponse; - assert(typeof d.scrape_request_id === "string", "scrape_request_id should be string"); - assert(typeof d.html === "string", "html should be string"); - assert(typeof d.status === "string", "status should be string"); -} - -async function testSitemap() { - const res = await sitemap(apiKey, { - website_url: "https://scrapegraphai.com", - }); - logResult("sitemap", res); - assert(res.status === "success", "sitemap status should be success"); - const d = res.data as SitemapResponse; - assert(typeof d.request_id === "string", "request_id should be string"); - assert(Array.isArray(d.urls), "urls should be array"); -} - -console.log("Running API battle tests...\n"); - -await testHealth(); -await testCredits(); -await testSmartScraper(); -await testSearchScraper(); -await testMarkdownify(); -await testScrape(); -await testSitemap(); - -console.log("\nAll tests passed."); diff --git a/media/banner.png b/media/banner.png new file mode 100644 index 0000000..8b06be5 Binary files /dev/null and b/media/banner.png differ diff --git a/package.json b/package.json index 0dc8b8c..f5cc9c1 100644 --- a/package.json +++ b/package.json @@ -16,8 +16,8 @@ "build": "tsup", "lint": "biome check .", "format": "biome format . --write", - "test": "bun test tests/", - "test:integration": "bun run integration_test.ts", + "test": "bun test tests/*.test.ts", + "test:integration": "bun test tests/*.spec.ts", "check": "tsc --noEmit && biome check .", "prepublishOnly": "tsup" }, @@ -39,5 +39,8 @@ "@types/node": "^22.13.1", "tsup": "^8.3.6", "typescript": "^5.8.2" + }, + "dependencies": { + "zod": "^4.3.6" } } diff --git a/src/index.ts b/src/index.ts index cb73196..f55a044 100644 --- a/src/index.ts +++ b/src/index.ts @@ -1,41 +1,63 @@ export { - smartScraper, - searchScraper, - markdownify, + ScrapeGraphAI, + type ScrapeGraphAIClient, + type ScrapeGraphAIInput, scrape, - crawl, - agenticScraper, - generateSchema, - sitemap, + extract, + search, getCredits, checkHealth, history, + crawl, + monitor, } from "./scrapegraphai.js"; export type { - AgenticScraperParams, - AgenticScraperResponse, + ApiFetchConfig, + ApiFetchContentType, + ApiHtmlMode, + ApiScrapeFormatEntry, + ApiScrapeRequest, + ApiScrapeResponse, + ApiScrapeFormat, + ApiScrapeResultMap, + ApiExtractRequest, + ApiExtractResponse, + ApiSearchRequest, + ApiSearchResponse, + ApiSearchResult, + ApiCrawlRequest, + ApiCrawlResponse, + ApiCrawlResult, + ApiCrawlPage, + ApiCrawlStatus, + ApiCrawlPageStatus, + ApiMonitorCreateInput, + ApiMonitorUpdateInput, + ApiMonitorResponse, + ApiMonitorResult, + ApiMonitorDiffs, + ApiHistoryFilter, + ApiHistoryEntry, + ApiHistoryPage, + ApiHistoryService, + ApiHistoryStatus, + ApiCreditsResponse, + ApiHealthResponse, ApiResult, - CrawlParams, - CrawlPage, - CrawlResponse, - CreditsResponse, - GenerateSchemaParams, - GenerateSchemaResponse, - HealthResponse, - HistoryEntry, - HistoryParams, - HistoryResponse, - MarkdownifyParams, - MarkdownifyResponse, - ScrapeParams, - ScrapeResponse, - SearchScraperParams, - SearchScraperResponse, - SitemapParams, - SitemapResponse, - SmartScraperParams, - SmartScraperResponse, -} from "./types/index.js"; + ApiTokenUsage, + ApiChunkerMetadata, + ApiBranding, +} from "./types.js"; -export { HISTORY_SERVICES } from "./types/index.js"; +export { + apiScrapeRequestSchema, + apiExtractRequestBaseSchema, + apiSearchRequestSchema, + apiCrawlRequestSchema, + apiMonitorCreateSchema, + apiMonitorUpdateSchema, + apiHistoryFilterSchema, + apiFetchConfigSchema, + apiScrapeFormatEntrySchema, +} from "./schemas.js"; diff --git a/src/schemas.ts b/src/schemas.ts new file mode 100644 index 0000000..dd8e2ab --- /dev/null +++ b/src/schemas.ts @@ -0,0 +1,268 @@ +import { z } from "zod"; + +export const apiServiceEnumSchema = z.enum(["scrape", "extract", "search", "monitor", "crawl"]); + +export const apiStatusEnumSchema = z.enum(["completed", "failed"]); + +export const apiHtmlModeSchema = z.enum(["normal", "reader", "prune"]); + +export const apiFetchContentTypeSchema = z.enum([ + "text/html", + "application/pdf", + "application/vnd.openxmlformats-officedocument.wordprocessingml.document", + "application/vnd.openxmlformats-officedocument.presentationml.presentation", + "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", + "image/jpeg", + "image/png", + "image/webp", + "image/gif", + "image/avif", + "image/tiff", + "image/heic", + "image/bmp", + "application/epub+zip", + "application/rtf", + "application/vnd.oasis.opendocument.text", + "text/csv", + "text/plain", + "application/x-latex", +]); + +export const apiUserPromptSchema = z.string().min(1).max(10_000); + +export const apiUrlSchema = z.string().url(); + +export const apiPaginationSchema = z.object({ + page: z.coerce.number().int().positive().default(1), + limit: z.coerce.number().int().positive().max(100).default(20), +}); + +export const apiUuidParamSchema = z.object({ + id: z.string().regex(/^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i), +}); + +export const apiFetchModeSchema = z.enum(["auto", "fast", "js"]); + +export const FETCH_CONFIG_DEFAULTS = { + mode: "auto", + stealth: false, + timeout: 30000, + wait: 0, + scrolls: 0, +} as const; + +export const apiFetchConfigSchema = z.object({ + mode: apiFetchModeSchema.default(FETCH_CONFIG_DEFAULTS.mode), + stealth: z.boolean().default(FETCH_CONFIG_DEFAULTS.stealth), + timeout: z.number().int().min(1000).max(60000).default(FETCH_CONFIG_DEFAULTS.timeout), + wait: z.number().int().min(0).max(30000).default(FETCH_CONFIG_DEFAULTS.wait), + headers: z.record(z.string(), z.string()).optional(), + cookies: z.record(z.string(), z.string()).optional(), + country: z + .string() + .length(2) + .transform((v) => v.toLowerCase()) + .optional(), + scrolls: z.number().int().min(0).max(100).default(FETCH_CONFIG_DEFAULTS.scrolls), + mock: z + .union([ + z.boolean(), + z.object({ + minKb: z.number().int().min(1).max(1000).default(1), + maxKb: z.number().int().min(1).max(1000).default(5), + minSleep: z.number().int().min(0).max(30000).default(5), + maxSleep: z.number().int().min(0).max(30000).default(15), + writeToBucket: z.boolean().default(false), + }), + ]) + .default(false), +}); + +export const apiHistoryFilterSchema = z.object({ + page: z.coerce.number().int().positive().default(1), + limit: z.coerce.number().int().min(1).max(100).default(20), + service: apiServiceEnumSchema.optional(), +}); + +export const apiScrapeContentFormatSchema = z.enum([ + "markdown", + "html", + "links", + "images", + "summary", + "json", + "branding", +]); + +export const apiScrapeCaptureFormatSchema = z.enum(["screenshot"]); + +export const apiScrapeFormatSchema = z.enum([ + ...apiScrapeContentFormatSchema.options, + ...apiScrapeCaptureFormatSchema.options, +]); + +export const apiMarkdownConfigSchema = z.object({ + mode: apiHtmlModeSchema.default("normal"), +}); + +export const apiHtmlConfigSchema = z.object({ + mode: apiHtmlModeSchema.default("normal"), +}); + +export const apiScreenshotConfigSchema = z.object({ + fullPage: z.boolean().default(false), + width: z.number().int().min(320).max(3840).default(1440), + height: z.number().int().min(200).max(2160).default(900), + quality: z.number().int().min(1).max(100).default(80), +}); + +export const apiScrapeJsonConfigSchema = z.object({ + prompt: apiUserPromptSchema, + schema: z.record(z.string(), z.unknown()).optional(), + mode: apiHtmlModeSchema.default("normal"), +}); + +export const apiScrapeSummaryConfigSchema = z.object({}); + +export const apiScrapeMarkdownFormatSchema = apiMarkdownConfigSchema.extend({ + type: z.literal("markdown"), +}); + +export const apiScrapeHtmlFormatSchema = apiHtmlConfigSchema.extend({ + type: z.literal("html"), +}); + +export const apiScrapeScreenshotFormatSchema = apiScreenshotConfigSchema.extend({ + type: z.literal("screenshot"), +}); + +export const apiScrapeJsonFormatSchema = apiScrapeJsonConfigSchema.extend({ + type: z.literal("json"), +}); + +export const apiScrapeLinksFormatSchema = z.object({ + type: z.literal("links"), +}); + +export const apiScrapeImagesFormatSchema = z.object({ + type: z.literal("images"), +}); + +export const apiScrapeSummaryFormatSchema = apiScrapeSummaryConfigSchema.extend({ + type: z.literal("summary"), +}); + +export const apiScrapeBrandingFormatSchema = z.object({ + type: z.literal("branding"), +}); + +export const apiScrapeFormatEntrySchema = z.discriminatedUnion("type", [ + apiScrapeMarkdownFormatSchema, + apiScrapeHtmlFormatSchema, + apiScrapeScreenshotFormatSchema, + apiScrapeJsonFormatSchema, + apiScrapeLinksFormatSchema, + apiScrapeImagesFormatSchema, + apiScrapeSummaryFormatSchema, + apiScrapeBrandingFormatSchema, +]); + +export const apiScrapeRequestSchema = z.object({ + url: apiUrlSchema, + contentType: apiFetchContentTypeSchema.optional(), + fetchConfig: apiFetchConfigSchema.optional(), + formats: z + .array(apiScrapeFormatEntrySchema) + .min(1) + .refine((formats) => new Set(formats.map((format) => format.type)).size === formats.length, { + message: "duplicate format types not allowed", + }) + .default([{ type: "markdown", mode: "normal" }]), +}); + +export const apiExtractRequestBaseSchema = z + .object({ + url: apiUrlSchema.optional(), + html: z.string().optional(), + markdown: z.string().optional(), + mode: apiHtmlModeSchema.default("normal"), + prompt: apiUserPromptSchema, + schema: z.record(z.string(), z.unknown()).optional(), + contentType: apiFetchContentTypeSchema.optional(), + fetchConfig: apiFetchConfigSchema.optional(), + }) + .refine((d) => d.url || d.html || d.markdown, { + message: "Either url, html, or markdown is required", + }); + +export const apiSearchRequestSchema = z + .object({ + query: z.string().min(1).max(500), + numResults: z.number().int().min(1).max(20).default(3), + format: z.enum(["html", "markdown"]).default("markdown"), + mode: apiHtmlModeSchema.default("prune"), + fetchConfig: apiFetchConfigSchema.optional(), + prompt: apiUserPromptSchema.optional(), + schema: z.record(z.string(), z.unknown()).optional(), + locationGeoCode: z.string().max(10).optional(), + timeRange: z + .enum(["past_hour", "past_24_hours", "past_week", "past_month", "past_year"]) + .optional(), + }) + .refine((d) => !d.schema || d.prompt, { + message: "schema requires prompt", + }); + +export const apiMonitorCreateSchema = z.object({ + url: apiUrlSchema, + name: z.string().max(200).optional(), + formats: z + .array(apiScrapeFormatEntrySchema) + .min(1) + .refine((formats) => new Set(formats.map((f) => f.type)).size === formats.length, { + message: "duplicate format types not allowed", + }) + .default([{ type: "markdown", mode: "normal" }]), + webhookUrl: apiUrlSchema.optional(), + interval: z.string().min(1).max(100), + fetchConfig: apiFetchConfigSchema.optional(), +}); + +export const apiMonitorUpdateSchema = z + .object({ + name: z.string().max(200).optional(), + formats: z + .array(apiScrapeFormatEntrySchema) + .min(1) + .refine((formats) => new Set(formats.map((f) => f.type)).size === formats.length, { + message: "duplicate format types not allowed", + }) + .optional(), + webhookUrl: apiUrlSchema.nullable().optional(), + interval: z.string().min(1).max(100).optional(), + fetchConfig: apiFetchConfigSchema.optional(), + }) + .partial(); + +export const apiCrawlStatusSchema = z.enum(["running", "completed", "failed", "paused", "deleted"]); + +export const apiCrawlPageStatusSchema = z.enum(["completed", "failed", "skipped"]); + +export const apiCrawlRequestSchema = z.object({ + url: apiUrlSchema, + formats: z + .array(apiScrapeFormatEntrySchema) + .min(1) + .refine((formats) => new Set(formats.map((f) => f.type)).size === formats.length, { + message: "duplicate format types not allowed", + }) + .default([{ type: "markdown", mode: "normal" }]), + maxDepth: z.coerce.number().int().min(0).default(2), + maxPages: z.coerce.number().int().min(1).max(1000).default(50), + maxLinksPerPage: z.coerce.number().int().min(1).default(10), + allowExternal: z.boolean().default(false), + includePatterns: z.array(z.string()).optional(), + excludePatterns: z.array(z.string()).optional(), + contentTypes: z.array(apiFetchContentTypeSchema).optional(), + fetchConfig: apiFetchConfigSchema.optional(), +}); diff --git a/src/scrapegraphai.ts b/src/scrapegraphai.ts index b91fa3b..7225154 100644 --- a/src/scrapegraphai.ts +++ b/src/scrapegraphai.ts @@ -1,33 +1,28 @@ import { env } from "./env.js"; import type { - AgenticScraperParams, - AgenticScraperResponse, + ApiCrawlRequest, + ApiCrawlResponse, + ApiCreditsResponse, + ApiExtractRequest, + ApiExtractResponse, + ApiHealthResponse, + ApiHistoryEntry, + ApiHistoryFilter, + ApiHistoryPage, + ApiMonitorCreateInput, + ApiMonitorResponse, + ApiMonitorUpdateInput, ApiResult, - CrawlParams, - CrawlResponse, - CreditsResponse, - GenerateSchemaParams, - GenerateSchemaResponse, - HealthResponse, - HistoryParams, - HistoryResponse, - MarkdownifyParams, - MarkdownifyResponse, - ScrapeParams, - ScrapeResponse, - SearchScraperParams, - SearchScraperResponse, - SitemapParams, - SitemapResponse, - SmartScraperParams, - SmartScraperResponse, -} from "./types/index.js"; - -const BASE_URL = process.env.SGAI_API_URL || "https://api.scrapegraphai.com/v1"; + ApiScrapeRequest, + ApiScrapeResponse, + ApiSearchRequest, + ApiSearchResponse, +} from "./types.js"; + +const BASE_URL = process.env.SGAI_API_URL || "https://api.scrapegraphai.com/v2"; const HEALTH_URL = process.env.SGAI_API_URL ? `${process.env.SGAI_API_URL.replace(/\/v\d+$/, "")}` : "https://api.scrapegraphai.com"; -const POLL_INTERVAL_MS = 3000; function debug(label: string, data?: unknown) { if (!env.debug) return; @@ -65,10 +60,16 @@ function mapHttpError(status: number): string { } } +function parseServerTiming(header: string | null): number | null { + if (!header) return null; + const match = header.match(/dur=(\d+(?:\.\d+)?)/); + return match ? Math.round(Number.parseFloat(match[1])) : null; +} + type RequestResult = { data: T; elapsedMs: number }; async function request( - method: "GET" | "POST", + method: "GET" | "POST" | "PATCH" | "DELETE", path: string, apiKey: string, body?: object, @@ -102,97 +103,32 @@ async function request( } const data = (await res.json()) as T; - const elapsedMs = Math.round(performance.now() - start); + const serverTiming = parseServerTiming(res.headers.get("Server-Timing")); + const elapsedMs = serverTiming ?? Math.round(performance.now() - start); debug(`← ${res.status} (${elapsedMs}ms)`, data); return { data, elapsedMs }; } -type PollResponse = { - status: string; - error?: string; - [key: string]: unknown; -}; - -function isDone(status: string) { - return status === "completed" || status === "done" || status === "success"; -} - -async function pollUntilDone( - path: string, - id: string, - apiKey: string, - onPoll?: (status: string) => void, -): Promise> { - const deadline = Date.now() + env.timeoutS * 1000; - let totalMs = 0; - - while (Date.now() < deadline) { - const { data, elapsedMs } = await request("GET", `${path}/${id}`, apiKey); - totalMs += elapsedMs; - onPoll?.(data.status); - - if (isDone(data.status)) return { data, elapsedMs: totalMs }; - if (data.status === "failed") throw new Error(data.error ?? "Job failed"); - - await new Promise((r) => setTimeout(r, POLL_INTERVAL_MS)); - } - - throw new Error("Polling timed out"); -} - -function unwrapResult(data: PollResponse): PollResponse { - if (data.result && typeof data.result === "object" && !Array.isArray(data.result)) { - const inner = data.result as Record; - if (inner.status || inner.pages || inner.crawled_urls) { - return { ...inner, status: String(inner.status ?? data.status) } as PollResponse; - } - } - return data; -} - -async function submitAndPoll( - path: string, - apiKey: string, - body: object, - idField: string, - onPoll?: (status: string) => void, -): Promise> { - const { data: res, elapsedMs } = await request("POST", path, apiKey, body); - if (isDone(res.status)) return { data: unwrapResult(res) as unknown as T, elapsedMs }; - const id = res[idField]; - if (typeof id !== "string") throw new Error(`Missing ${idField} in response`); - const poll = await pollUntilDone(path, id, apiKey, onPoll); - return { - data: unwrapResult(poll.data) as unknown as T, - elapsedMs: elapsedMs + poll.elapsedMs, - }; -} - -export async function smartScraper( +export async function scrape( apiKey: string, - params: SmartScraperParams, -): Promise> { + params: ApiScrapeRequest, +): Promise> { try { - const { data, elapsedMs } = await request( - "POST", - "/smartscraper", - apiKey, - params, - ); + const { data, elapsedMs } = await request("POST", "/scrape", apiKey, params); return ok(data, elapsedMs); } catch (err) { return fail(err); } } -export async function searchScraper( +export async function extract( apiKey: string, - params: SearchScraperParams, -): Promise> { + params: ApiExtractRequest, +): Promise> { try { - const { data, elapsedMs } = await request( + const { data, elapsedMs } = await request( "POST", - "/searchscraper", + "/extract", apiKey, params, ); @@ -202,47 +138,35 @@ export async function searchScraper( } } -export async function markdownify( +export async function search( apiKey: string, - params: MarkdownifyParams, -): Promise> { + params: ApiSearchRequest, +): Promise> { try { - const { data, elapsedMs } = await request( - "POST", - "/markdownify", - apiKey, - params, - ); + const { data, elapsedMs } = await request("POST", "/search", apiKey, params); return ok(data, elapsedMs); } catch (err) { return fail(err); } } -export async function scrape( - apiKey: string, - params: ScrapeParams, -): Promise> { +export async function getCredits(apiKey: string): Promise> { try { - const { data, elapsedMs } = await request("POST", "/scrape", apiKey, params); + const { data, elapsedMs } = await request("GET", "/credits", apiKey); return ok(data, elapsedMs); } catch (err) { return fail(err); } } -export async function crawl( - apiKey: string, - params: CrawlParams, - onPoll?: (status: string) => void, -): Promise> { +export async function checkHealth(apiKey: string): Promise> { try { - const { data, elapsedMs } = await submitAndPoll( - "/crawl", + const { data, elapsedMs } = await request( + "GET", + "/healthz", apiKey, - params, - "task_id", - onPoll, + undefined, + HEALTH_URL, ); return ok(data, elapsedMs); } catch (err) { @@ -250,93 +174,224 @@ export async function crawl( } } -export async function agenticScraper( - apiKey: string, - params: AgenticScraperParams, -): Promise> { - try { - const { data, elapsedMs } = await request( - "POST", - "/agentic-scrapper", - apiKey, - params, - ); - return ok(data, elapsedMs); - } catch (err) { - return fail(err); - } -} +export const history = { + async list(apiKey: string, params?: ApiHistoryFilter): Promise> { + try { + const qs = new URLSearchParams(); + if (params?.page) qs.set("page", String(params.page)); + if (params?.limit) qs.set("limit", String(params.limit)); + if (params?.service) qs.set("service", params.service); + const query = qs.toString(); + const path = query ? `/history?${query}` : "/history"; + const { data, elapsedMs } = await request("GET", path, apiKey); + return ok(data, elapsedMs); + } catch (err) { + return fail(err); + } + }, -export async function generateSchema( - apiKey: string, - params: GenerateSchemaParams, -): Promise> { - try { - const { data, elapsedMs } = await request( - "POST", - "/generate_schema", - apiKey, - params, - ); - return ok(data, elapsedMs); - } catch (err) { - return fail(err); - } -} + async get(apiKey: string, id: string): Promise> { + try { + const { data, elapsedMs } = await request("GET", `/history/${id}`, apiKey); + return ok(data, elapsedMs); + } catch (err) { + return fail(err); + } + }, +}; -export async function sitemap( - apiKey: string, - params: SitemapParams, -): Promise> { - try { - const { data, elapsedMs } = await request("POST", "/sitemap", apiKey, params); - return ok(data, elapsedMs); - } catch (err) { - return fail(err); - } -} +export const crawl = { + async start(apiKey: string, params: ApiCrawlRequest): Promise> { + try { + const { data, elapsedMs } = await request("POST", "/crawl", apiKey, params); + return ok(data, elapsedMs); + } catch (err) { + return fail(err); + } + }, -export async function getCredits(apiKey: string): Promise> { - try { - const { data, elapsedMs } = await request("GET", "/credits", apiKey); - return ok(data, elapsedMs); - } catch (err) { - return fail(err); - } + async get(apiKey: string, id: string): Promise> { + try { + const { data, elapsedMs } = await request("GET", `/crawl/${id}`, apiKey); + return ok(data, elapsedMs); + } catch (err) { + return fail(err); + } + }, + + async stop(apiKey: string, id: string): Promise> { + try { + const { data, elapsedMs } = await request<{ ok: boolean }>( + "POST", + `/crawl/${id}/stop`, + apiKey, + ); + return ok(data, elapsedMs); + } catch (err) { + return fail(err); + } + }, + + async resume(apiKey: string, id: string): Promise> { + try { + const { data, elapsedMs } = await request<{ ok: boolean }>( + "POST", + `/crawl/${id}/resume`, + apiKey, + ); + return ok(data, elapsedMs); + } catch (err) { + return fail(err); + } + }, + + async delete(apiKey: string, id: string): Promise> { + try { + const { data, elapsedMs } = await request<{ ok: boolean }>("DELETE", `/crawl/${id}`, apiKey); + return ok(data, elapsedMs); + } catch (err) { + return fail(err); + } + }, +}; + +export const monitor = { + async create( + apiKey: string, + params: ApiMonitorCreateInput, + ): Promise> { + try { + const { data, elapsedMs } = await request( + "POST", + "/monitor", + apiKey, + params, + ); + return ok(data, elapsedMs); + } catch (err) { + return fail(err); + } + }, + + async list(apiKey: string): Promise> { + try { + const { data, elapsedMs } = await request("GET", "/monitor", apiKey); + return ok(data, elapsedMs); + } catch (err) { + return fail(err); + } + }, + + async get(apiKey: string, id: string): Promise> { + try { + const { data, elapsedMs } = await request( + "GET", + `/monitor/${id}`, + apiKey, + ); + return ok(data, elapsedMs); + } catch (err) { + return fail(err); + } + }, + + async update( + apiKey: string, + id: string, + params: ApiMonitorUpdateInput, + ): Promise> { + try { + const { data, elapsedMs } = await request( + "PATCH", + `/monitor/${id}`, + apiKey, + params, + ); + return ok(data, elapsedMs); + } catch (err) { + return fail(err); + } + }, + + async delete(apiKey: string, id: string): Promise> { + try { + const { data, elapsedMs } = await request<{ ok: boolean }>( + "DELETE", + `/monitor/${id}`, + apiKey, + ); + return ok(data, elapsedMs); + } catch (err) { + return fail(err); + } + }, + + async pause(apiKey: string, id: string): Promise> { + try { + const { data, elapsedMs } = await request( + "POST", + `/monitor/${id}/pause`, + apiKey, + ); + return ok(data, elapsedMs); + } catch (err) { + return fail(err); + } + }, + + async resume(apiKey: string, id: string): Promise> { + try { + const { data, elapsedMs } = await request( + "POST", + `/monitor/${id}/resume`, + apiKey, + ); + return ok(data, elapsedMs); + } catch (err) { + return fail(err); + } + }, +}; + +export interface ScrapeGraphAIInput { + apiKey?: string; } -export async function checkHealth(apiKey: string): Promise> { - try { - const { data, elapsedMs } = await request( - "GET", - "/healthz", - apiKey, - undefined, - HEALTH_URL, - ); - return ok(data, elapsedMs); - } catch (err) { - return fail(err); - } +function resolveApiKey(opts?: ScrapeGraphAIInput): string { + const key = opts?.apiKey ?? process.env.SGAI_API_KEY; + if (!key) throw new Error("API key required: pass { apiKey } or set SGAI_API_KEY env var"); + return key; } -export async function history( - apiKey: string, - params: HistoryParams, -): Promise> { - try { - const page = params.page ?? 1; - const pageSize = params.page_size ?? 10; - const qs = new URLSearchParams(); - qs.set("page", String(page)); - qs.set("page_size", String(pageSize)); - const { data, elapsedMs } = await request( - "GET", - `/history/${params.service}?${qs}`, - apiKey, - ); - return ok(data, elapsedMs); - } catch (err) { - return fail(err); - } +export function ScrapeGraphAI(opts?: ScrapeGraphAIInput) { + const key = resolveApiKey(opts); + return { + scrape: (params: ApiScrapeRequest) => scrape(key, params), + extract: (params: ApiExtractRequest) => extract(key, params), + search: (params: ApiSearchRequest) => search(key, params), + credits: () => getCredits(key), + healthy: () => checkHealth(key), + history: { + list: (params?: ApiHistoryFilter) => history.list(key, params), + get: (id: string) => history.get(key, id), + }, + crawl: { + start: (params: ApiCrawlRequest) => crawl.start(key, params), + get: (id: string) => crawl.get(key, id), + stop: (id: string) => crawl.stop(key, id), + resume: (id: string) => crawl.resume(key, id), + delete: (id: string) => crawl.delete(key, id), + }, + monitor: { + create: (params: ApiMonitorCreateInput) => monitor.create(key, params), + list: () => monitor.list(key), + get: (id: string) => monitor.get(key, id), + update: (id: string, params: ApiMonitorUpdateInput) => monitor.update(key, id, params), + delete: (id: string) => monitor.delete(key, id), + pause: (id: string) => monitor.pause(key, id), + resume: (id: string) => monitor.resume(key, id), + }, + }; } + +export type ScrapeGraphAIClient = ReturnType; diff --git a/src/types.ts b/src/types.ts new file mode 100644 index 0000000..726d243 --- /dev/null +++ b/src/types.ts @@ -0,0 +1,390 @@ +import type { z } from "zod"; +import type { + apiCrawlRequestSchema, + apiExtractRequestBaseSchema, + apiFetchConfigSchema, + apiFetchContentTypeSchema, + apiHistoryFilterSchema, + apiHtmlModeSchema, + apiMonitorCreateSchema, + apiMonitorUpdateSchema, + apiScrapeFormatEntrySchema, + apiScrapeRequestSchema, + apiSearchRequestSchema, +} from "./schemas.js"; + +export type ApiFetchConfig = z.input; +export type ApiFetchContentType = z.infer; +export type ApiHtmlMode = z.infer; +export type ApiScrapeFormatEntry = z.input; + +export type ApiScrapeRequest = z.input; +export type ApiExtractRequest = z.input; +export type ApiSearchRequest = z.input; +export type ApiCrawlRequest = z.input; +export type ApiMonitorCreateInput = z.input; +export type ApiMonitorUpdateInput = z.input; +export type ApiHistoryFilter = z.input; + +export type ApiScrapeFormat = + | "markdown" + | "html" + | "links" + | "images" + | "summary" + | "json" + | "branding" + | "screenshot"; + +export interface ApiTokenUsage { + promptTokens: number; + completionTokens: number; +} + +export interface ApiChunkerMetadata { + chunks: { size: number }[]; +} + +export interface ApiFetchWarning { + reason: "too_short" | "empty" | "bot_blocked" | "spa_shell" | "soft_404"; + provider?: string; +} + +export interface ScrapeMetadata { + provider?: string; + contentType: string; + elapsedMs?: number; + warnings?: ApiFetchWarning[]; + ocr?: { + model: string; + pagesProcessed: number; + pages: ContentPageMetadata[]; + }; +} + +export interface ContentPageMetadata { + index: number; + images: Array<{ + id: string; + topLeftX: number; + topLeftY: number; + bottomRightX: number; + bottomRightY: number; + }>; + tables: Array<{ id: string; content: string; format: string }>; + hyperlinks: string[]; + dimensions: { dpi: number; height: number; width: number }; +} + +export interface ApiBrandingColors { + primary: string; + accent: string; + background: string; + textPrimary: string; + link: string; +} + +export interface ApiBrandingFontEntry { + family: string; + fallback: string; +} + +export interface ApiBrandingTypography { + primary: ApiBrandingFontEntry; + heading: ApiBrandingFontEntry; + mono: ApiBrandingFontEntry; + sizes: { h1: string; h2: string; body: string }; +} + +export interface ApiBrandingImages { + logo: string; + favicon: string; + ogImage: string; +} + +export interface ApiBrandingPersonality { + tone: string; + energy: "high" | "medium" | "low"; + targetAudience: string; +} + +export interface ApiBranding { + colorScheme: "light" | "dark"; + colors: ApiBrandingColors; + typography: ApiBrandingTypography; + images: ApiBrandingImages; + spacing: { baseUnit: number; borderRadius: string }; + frameworkHints: string[]; + personality: ApiBrandingPersonality; + confidence: number; +} + +export interface ApiBrandingMetadata { + title: string; + description: string; + favicon: string; + language: string; + themeColor: string; + ogTitle: string; + ogDescription: string; + ogImage: string; + ogUrl: string; +} + +export interface ApiScrapeScreenshotData { + url: string; + width: number; + height: number; +} + +export interface ApiScrapeFormatError { + code: string; + error: string; +} + +export interface ApiScrapeFormatResponseMap { + markdown: string[]; + html: string[]; + links: string[]; + images: string[]; + summary: string; + json: Record; + branding: ApiBranding; + screenshot: ApiScrapeScreenshotData; +} + +export type ApiImageContentType = Extract; + +export interface ApiScrapeFormatMetadataMap { + markdown: Record; + html: Record; + links: { count: number }; + images: { count: number }; + summary: { chunker?: ApiChunkerMetadata }; + json: { chunker: ApiChunkerMetadata; raw?: string | null }; + branding: { branding: ApiBrandingMetadata }; + screenshot: { contentType: ApiImageContentType; provider?: string }; +} + +export type ApiScrapeResultMap = Partial<{ + [K in ApiScrapeFormat]: { + data: ApiScrapeFormatResponseMap[K]; + metadata?: ApiScrapeFormatMetadataMap[K]; + }; +}>; + +export interface ApiScrapeResponse { + results: ApiScrapeResultMap; + metadata: ScrapeMetadata; + errors?: Partial<{ [K in ApiScrapeFormat]: ApiScrapeFormatError }>; +} + +export interface ApiExtractResponse { + raw: string | null; + json: Record | null; + usage: ApiTokenUsage; + metadata: { + chunker: ApiChunkerMetadata; + fetch?: { provider?: string }; + }; +} + +export interface ApiSearchResult { + url: string; + title: string; + content: string; + provider?: string; +} + +export interface ApiSearchMetadata { + search: { provider?: string }; + pages: { requested: number; scraped: number }; + chunker?: ApiChunkerMetadata; +} + +export interface ApiSearchResponse { + results: ApiSearchResult[]; + json?: Record | null; + raw?: string | null; + usage?: ApiTokenUsage; + metadata: ApiSearchMetadata; +} + +export type ApiCrawlStatus = "running" | "completed" | "failed" | "paused" | "deleted"; +export type ApiCrawlPageStatus = "completed" | "failed" | "skipped"; + +export interface ApiCrawlPage { + url: string; + status: ApiCrawlPageStatus; + depth: number; + parentUrl: string | null; + links: string[]; + scrapeRefId: string; + title: string; + contentType: string; + screenshotUrl?: string; + reason?: string; + error?: string; +} + +export interface ApiCrawlResult { + status: ApiCrawlStatus; + reason?: string; + total: number; + finished: number; + pages: ApiCrawlPage[]; +} + +export interface ApiCrawlResponse extends ApiCrawlResult { + id: string; +} + +export interface TextChange { + type: "added" | "removed"; + line: number; + content: string; +} + +export interface JsonChange { + path: string; + old: unknown; + new: unknown; +} + +export interface SetChange { + added: string[]; + removed: string[]; +} + +export interface ImageChange { + size: number; + changed: number; + mask?: string; +} + +export interface ApiMonitorDiffs { + markdown?: TextChange[]; + html?: TextChange[]; + json?: JsonChange[]; + screenshot?: ImageChange; + links?: SetChange; + images?: SetChange; + summary?: TextChange[]; + branding?: JsonChange[]; +} + +export type ApiMonitorRefs = Partial>; + +export interface ApiWebhookStatus { + sentAt: string; + statusCode: number | null; + error?: string; +} + +export interface ApiMonitorResult { + changed: boolean; + diffs: ApiMonitorDiffs; + refs: ApiMonitorRefs; + webhookStatus?: ApiWebhookStatus; +} + +export interface ApiMonitorResponse { + cronId: string; + scheduleId: string; + interval: string; + status: "active" | "paused"; + config: ApiMonitorCreateInput; + createdAt: string; + updatedAt: string; +} + +export type ApiHistoryService = "scrape" | "extract" | "search" | "monitor" | "crawl"; +export type ApiHistoryStatus = "completed" | "failed" | "running" | "paused" | "deleted"; + +interface ApiHistoryBase { + id: string; + status: ApiHistoryStatus; + error: unknown; + elapsedMs: number; + createdAt: string; + requestParentId: string | null; +} + +export interface ApiScrapeHistoryEntry extends ApiHistoryBase { + service: "scrape"; + params: ApiScrapeRequest; + result: ApiScrapeResponse; +} + +export interface ApiExtractHistoryEntry extends ApiHistoryBase { + service: "extract"; + params: ApiExtractRequest; + result: ApiExtractResponse; +} + +export interface ApiSearchHistoryEntry extends ApiHistoryBase { + service: "search"; + params: ApiSearchRequest; + result: ApiSearchResponse; +} + +export interface ApiMonitorHistoryEntry extends ApiHistoryBase { + service: "monitor"; + params: { cronId: string; url: string }; + result: ApiMonitorResult; +} + +export interface ApiCrawlHistoryEntry extends ApiHistoryBase { + service: "crawl"; + params: { url: string; maxPages: number }; + result: ApiCrawlResult; +} + +export type ApiHistoryEntry = + | ApiScrapeHistoryEntry + | ApiExtractHistoryEntry + | ApiSearchHistoryEntry + | ApiMonitorHistoryEntry + | ApiCrawlHistoryEntry; + +export interface ApiPageResponse { + data: T[]; + pagination: { + page: number; + limit: number; + total: number; + }; +} + +export type ApiHistoryPage = ApiPageResponse; + +export interface ApiJobsStatus { + used: number; + limit: number; +} + +export interface ApiCreditsResponse { + remaining: number; + used: number; + plan: string; + jobs: { + crawl: ApiJobsStatus; + monitor: ApiJobsStatus; + }; +} + +export interface ApiHealthResponse { + status: string; + uptime: number; + services?: { + redis: "ok" | "down"; + db: "ok" | "down"; + }; +} + +export interface ApiResult { + status: "success" | "error"; + data: T | null; + error?: string; + elapsedMs: number; +} diff --git a/src/types/index.ts b/src/types/index.ts deleted file mode 100644 index e6f1360..0000000 --- a/src/types/index.ts +++ /dev/null @@ -1,228 +0,0 @@ -export type SmartScraperParams = { - website_url?: string; - website_html?: string; - website_markdown?: string; - user_prompt: string; - output_schema?: Record; - number_of_scrolls?: number; - total_pages?: number; - stealth?: boolean; - cookies?: Record; - headers?: Record; - plain_text?: boolean; - webhook_url?: string; - mock?: boolean; - steps?: string[]; - wait_ms?: number; - country_code?: string; -}; - -export type SearchScraperParams = { - user_prompt: string; - num_results?: number; - extraction_mode?: boolean; - output_schema?: Record; - stealth?: boolean; - headers?: Record; - webhook_url?: string; - mock?: boolean; - time_range?: "past_hour" | "past_24_hours" | "past_week" | "past_month" | "past_year"; - location_geo_code?: string; -}; - -export type MarkdownifyParams = { - website_url: string; - stealth?: boolean; - headers?: Record; - webhook_url?: string; - mock?: boolean; - wait_ms?: number; - country_code?: string; -}; - -type CrawlBase = { - url: string; - max_pages?: number; - depth?: number; - rules?: Record; - sitemap?: boolean; - stealth?: boolean; - webhook_url?: string; - cache_website?: boolean; - breadth?: number; - same_domain_only?: boolean; - batch_size?: number; - wait_ms?: number; - headers?: Record; - number_of_scrolls?: number; - website_html?: string; -}; - -type CrawlExtraction = CrawlBase & { - extraction_mode?: true; - prompt: string; - schema?: Record; -}; - -type CrawlMarkdown = CrawlBase & { - extraction_mode: false; - prompt?: never; - schema?: never; -}; - -export type CrawlParams = CrawlExtraction | CrawlMarkdown; - -export type GenerateSchemaParams = { - user_prompt: string; - existing_schema?: Record; -}; - -export type SitemapParams = { - website_url: string; - headers?: Record; - mock?: boolean; - stealth?: boolean; -}; - -export type ScrapeParams = { - website_url: string; - stealth?: boolean; - branding?: boolean; - country_code?: string; - wait_ms?: number; -}; - -export type AgenticScraperParams = { - url: string; - steps: string[]; - user_prompt?: string; - output_schema?: Record; - ai_extraction?: boolean; - use_session?: boolean; -}; - -export const HISTORY_SERVICES = [ - "markdownify", - "smartscraper", - "searchscraper", - "scrape", - "crawl", - "agentic-scraper", - "sitemap", -] as const; - -export type HistoryParams = { - service: (typeof HISTORY_SERVICES)[number]; - page?: number; - page_size?: number; -}; - -export type ApiResult = { - status: "success" | "error"; - data: T | null; - error?: string; - elapsedMs: number; -}; - -export type SmartScraperResponse = { - request_id: string; - status: string; - website_url: string; - user_prompt: string; - result: Record | null; - error?: string; -}; - -export type SearchScraperResponse = { - request_id: string; - status: string; - user_prompt: string; - num_results?: number; - result: Record | null; - markdown_content?: string | null; - reference_urls: string[]; - error?: string | null; -}; - -export type MarkdownifyResponse = { - request_id: string; - status: string; - website_url: string; - result: string | null; - error?: string; -}; - -export type CrawlPage = { - url: string; - markdown: string; -}; - -export type CrawlResponse = { - task_id: string; - status: string; - result?: Record | null; - llm_result?: Record | null; - crawled_urls?: string[]; - pages?: CrawlPage[]; - credits_used?: number; - pages_processed?: number; - elapsed_time?: number; - error?: string; -}; - -export type ScrapeResponse = { - scrape_request_id: string; - status: string; - html: string; - branding?: Record | null; - metadata?: Record | null; - error?: string; -}; - -export type AgenticScraperResponse = { - request_id: string; - status: string; - result: Record | null; - error?: string; -}; - -export type GenerateSchemaResponse = { - request_id: string; - status: string; - user_prompt: string; - refined_prompt?: string | null; - generated_schema?: Record | null; - error?: string | null; - created_at?: string | null; - updated_at?: string | null; -}; - -export type SitemapResponse = { - request_id: string; - urls: string[]; - status?: string; - website_url?: string; - error?: string; -}; - -export type CreditsResponse = { - remaining_credits: number; - total_credits_used: number; -}; - -export type HealthResponse = { - status: string; -}; - -export type HistoryResponse = { - requests: HistoryEntry[]; - total_count: number; - page: number; - page_size: number; -}; - -export type HistoryEntry = { - request_id: string; - status: string; - [key: string]: unknown; -}; diff --git a/tests/integration.spec.ts b/tests/integration.spec.ts new file mode 100644 index 0000000..5380eb9 --- /dev/null +++ b/tests/integration.spec.ts @@ -0,0 +1,106 @@ +import { describe, expect, test } from "bun:test"; +import { crawl, extract, getCredits, history, scrape, search } from "../src/index.js"; + +const API_KEY = process.env.SGAI_API_KEY; +if (!API_KEY) throw new Error("SGAI_API_KEY env var required for integration tests"); + +describe("integration", () => { + test("getCredits", async () => { + const res = await getCredits(API_KEY); + console.log("getCredits:", res); + expect(res.status).toBe("success"); + expect(res.data).toHaveProperty("remaining"); + expect(res.data).toHaveProperty("plan"); + }); + + test("scrape markdown", async () => { + const res = await scrape(API_KEY, { + url: "https://example.com", + formats: [{ type: "markdown" }], + }); + console.log("scrape:", res.status, res.error); + expect(res.status).toBe("success"); + expect(res.data?.results.markdown).toBeDefined(); + }); + + test("scrape with multiple formats", async () => { + const res = await scrape(API_KEY, { + url: "https://example.com", + formats: [{ type: "markdown", mode: "reader" }, { type: "links" }, { type: "images" }], + }); + console.log("scrape multi:", res.status, res.error); + expect(res.status).toBe("success"); + expect(res.data?.results.markdown).toBeDefined(); + expect(res.data?.results.links).toBeDefined(); + }); + + test("scrape PDF document", async () => { + const res = await scrape(API_KEY, { + url: "https://pdfobject.com/pdf/sample.pdf", + contentType: "application/pdf", + formats: [{ type: "markdown" }], + }); + console.log("scrape PDF:", res.status, res.error); + expect(res.status).toBe("success"); + expect(res.data?.metadata.contentType).toBe("application/pdf"); + }); + + test("scrape with fetchConfig", async () => { + const res = await scrape(API_KEY, { + url: "https://example.com", + fetchConfig: { mode: "fast", timeout: 15000 }, + formats: [{ type: "markdown" }], + }); + console.log("scrape fetchConfig:", res.status, res.error); + expect(res.status).toBe("success"); + }); + + test("extract", async () => { + const res = await extract(API_KEY, { + url: "https://example.com", + prompt: "What is this page about?", + }); + console.log("extract:", res.status, res.error); + expect(res.status).toBe("success"); + }); + + test("search", async () => { + const res = await search(API_KEY, { + query: "anthropic claude", + numResults: 2, + }); + console.log("search:", res.status, res.error); + expect(res.status).toBe("success"); + expect(res.data?.results.length).toBeGreaterThan(0); + }); + + test("history.list", async () => { + const res = await history.list(API_KEY, { limit: 5 }); + console.log("history.list:", res.status, res.data?.pagination); + expect(res.status).toBe("success"); + }); + + test("crawl.start and crawl.get", async () => { + const startRes = await crawl.start(API_KEY, { + url: "https://example.com", + maxPages: 2, + }); + console.log("crawl.start:", startRes.status, startRes.data?.id, startRes.error); + + if ( + startRes.status === "error" && + (startRes.error?.includes("Max") || startRes.error?.includes("Rate")) + ) { + console.log("Skipping - rate limited"); + return; + } + + expect(startRes.status).toBe("success"); + + if (startRes.data?.id) { + const getRes = await crawl.get(API_KEY, startRes.data.id); + console.log("crawl.get:", getRes.status, getRes.data?.status); + expect(getRes.status).toBe("success"); + } + }); +}); diff --git a/tests/scrapegraphai.test.ts b/tests/scrapegraphai.test.ts index 4186453..69a6695 100644 --- a/tests/scrapegraphai.test.ts +++ b/tests/scrapegraphai.test.ts @@ -1,13 +1,11 @@ -import { afterEach, describe, expect, mock, spyOn, test } from "bun:test"; +import { afterEach, describe, expect, spyOn, test } from "bun:test"; +import * as sdk from "../src/scrapegraphai.js"; -mock.module("../src/env.js", () => ({ - env: { debug: false, timeoutS: 120 }, -})); - -import * as scrapegraphai from "../src/scrapegraphai.js"; - -const API_KEY = "test-sgai-key-abc123"; -const BASE = "https://api.scrapegraphai.com/v1"; +const API_KEY = "test-sgai-key"; +const BASE = process.env.SGAI_API_URL || "https://api.scrapegraphai.com/v2"; +const HEALTH_BASE = process.env.SGAI_API_URL + ? process.env.SGAI_API_URL.replace(/\/v\d+$/, "") + : "https://api.scrapegraphai.com"; function json(body: unknown, status = 200): Response { return new Response(JSON.stringify(body), { @@ -22,50 +20,379 @@ afterEach(() => { fetchSpy?.mockRestore(); }); -function expectPost(callIndex: number, path: string, body?: object) { - const [url, init] = fetchSpy.mock.calls[callIndex] as [string, RequestInit]; - expect(url).toBe(`${BASE}${path}`); - expect(init.method).toBe("POST"); - expect((init.headers as Record)["SGAI-APIKEY"]).toBe(API_KEY); - expect((init.headers as Record)["Content-Type"]).toBe("application/json"); - if (body) expect(JSON.parse(init.body as string)).toEqual(body); -} - -function expectGet(callIndex: number, path: string) { +function expectRequest( + callIndex: number, + method: string, + path: string, + body?: object, + base = BASE, +) { const [url, init] = fetchSpy.mock.calls[callIndex] as [string, RequestInit]; - expect(url).toBe(`${BASE}${path}`); - expect(init.method).toBe("GET"); + expect(url).toBe(`${base}${path}`); + expect(init.method).toBe(method); expect((init.headers as Record)["SGAI-APIKEY"]).toBe(API_KEY); + if (body) { + expect((init.headers as Record)["Content-Type"]).toBe("application/json"); + expect(JSON.parse(init.body as string)).toEqual(body); + } } -describe("smartScraper", () => { - const params = { user_prompt: "Extract prices", website_url: "https://example.com" }; +describe("scrape", () => { + const params = { url: "https://example.com" }; test("success", async () => { const body = { - request_id: "abc-123", - status: "completed", - website_url: "https://example.com", - user_prompt: "Extract prices", - result: { prices: [10, 20] }, - error: "", + results: { markdown: { data: ["# Hello"] } }, + metadata: { contentType: "text/html" }, }; fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); - const res = await scrapegraphai.smartScraper(API_KEY, params); + const res = await sdk.scrape(API_KEY, params); expect(res.status).toBe("success"); expect(res.data).toEqual(body); expect(res.elapsedMs).toBeGreaterThanOrEqual(0); - expect(fetchSpy).toHaveBeenCalledTimes(1); - expectPost(0, "/smartscraper", params); + expectRequest(0, "POST", "/scrape", params); + }); + + test("with fetchConfig - js mode and stealth", async () => { + const body = { + results: { markdown: { data: ["# Hello"] } }, + metadata: { contentType: "text/html", provider: "playwright" }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const paramsWithConfig = { + url: "https://example.com", + fetchConfig: { + mode: "js" as const, + stealth: true, + timeout: 45000, + wait: 2000, + scrolls: 3, + }, + formats: [{ type: "markdown" as const }], + }; + + const res = await sdk.scrape(API_KEY, paramsWithConfig); + + expect(res.status).toBe("success"); + expectRequest(0, "POST", "/scrape", paramsWithConfig); + }); + + test("with fetchConfig - headers and cookies", async () => { + const body = { + results: { html: { data: [""] } }, + metadata: { contentType: "text/html" }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const paramsWithConfig = { + url: "https://example.com", + fetchConfig: { + mode: "fast" as const, + headers: { "X-Custom-Header": "test-value", Authorization: "Bearer token123" }, + cookies: { session: "abc123", tracking: "xyz789" }, + }, + formats: [{ type: "html" as const }], + }; + + const res = await sdk.scrape(API_KEY, paramsWithConfig); + + expect(res.status).toBe("success"); + expectRequest(0, "POST", "/scrape", paramsWithConfig); + }); + + test("with fetchConfig - country geo targeting", async () => { + const body = { + results: { markdown: { data: ["# Localized content"] } }, + metadata: { contentType: "text/html" }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const paramsWithConfig = { + url: "https://example.com", + fetchConfig: { country: "de" }, + formats: [{ type: "markdown" as const }], + }; + + const res = await sdk.scrape(API_KEY, paramsWithConfig); + + expect(res.status).toBe("success"); + expectRequest(0, "POST", "/scrape", paramsWithConfig); + }); + + test("multiple formats - markdown, html, links, images", async () => { + const body = { + results: { + markdown: { data: ["# Title"] }, + html: { data: ["

Title

"] }, + links: { data: ["https://example.com/page1"], metadata: { count: 1 } }, + images: { data: ["https://example.com/image.png"], metadata: { count: 1 } }, + }, + metadata: { contentType: "text/html" }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const multiFormatParams = { + url: "https://example.com", + formats: [ + { type: "markdown" as const, mode: "reader" as const }, + { type: "html" as const, mode: "prune" as const }, + { type: "links" as const }, + { type: "images" as const }, + ], + }; + + const res = await sdk.scrape(API_KEY, multiFormatParams); + + expect(res.status).toBe("success"); + expect(res.data?.results.markdown).toBeDefined(); + expect(res.data?.results.html).toBeDefined(); + expect(res.data?.results.links).toBeDefined(); + expect(res.data?.results.images).toBeDefined(); + expectRequest(0, "POST", "/scrape", multiFormatParams); + }); + + test("screenshot format with options", async () => { + const body = { + results: { + screenshot: { + data: { url: "https://storage.example.com/shot.png", width: 1920, height: 1080 }, + metadata: { contentType: "image/png" }, + }, + }, + metadata: { contentType: "text/html" }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const screenshotParams = { + url: "https://example.com", + formats: [ + { + type: "screenshot" as const, + fullPage: true, + width: 1920, + height: 1080, + quality: 95, + }, + ], + }; + + const res = await sdk.scrape(API_KEY, screenshotParams); + + expect(res.status).toBe("success"); + expect(res.data?.results.screenshot?.data.url).toBeDefined(); + expectRequest(0, "POST", "/scrape", screenshotParams); + }); + + test("json format with prompt and schema", async () => { + const body = { + results: { + json: { + data: { title: "Example", price: 99.99 }, + metadata: { chunker: { chunks: [{ size: 500 }] } }, + }, + }, + metadata: { contentType: "text/html" }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const jsonParams = { + url: "https://example.com/product", + formats: [ + { + type: "json" as const, + prompt: "Extract product title and price", + schema: { + type: "object", + properties: { + title: { type: "string" }, + price: { type: "number" }, + }, + }, + }, + ], + }; + + const res = await sdk.scrape(API_KEY, jsonParams); + + expect(res.status).toBe("success"); + expect(res.data?.results.json?.data).toEqual({ title: "Example", price: 99.99 }); + expectRequest(0, "POST", "/scrape", jsonParams); + }); + + test("summary format", async () => { + const body = { + results: { + summary: { + data: "This is a summary of the page content.", + metadata: { chunker: { chunks: [{ size: 1000 }] } }, + }, + }, + metadata: { contentType: "text/html" }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const summaryParams = { + url: "https://example.com/article", + formats: [{ type: "summary" as const }], + }; + + const res = await sdk.scrape(API_KEY, summaryParams); + + expect(res.status).toBe("success"); + expect(res.data?.results.summary?.data).toBe("This is a summary of the page content."); + expectRequest(0, "POST", "/scrape", summaryParams); + }); + + test("branding format", async () => { + const body = { + results: { + branding: { + data: { + colorScheme: "light", + colors: { + primary: "#0066cc", + accent: "#ff6600", + background: "#ffffff", + textPrimary: "#333333", + link: "#0066cc", + }, + typography: { + primary: { family: "Inter", fallback: "sans-serif" }, + heading: { family: "Inter", fallback: "sans-serif" }, + mono: { family: "Fira Code", fallback: "monospace" }, + sizes: { h1: "2.5rem", h2: "2rem", body: "1rem" }, + }, + images: { logo: "", favicon: "", ogImage: "" }, + spacing: { baseUnit: 8, borderRadius: "4px" }, + frameworkHints: ["react"], + personality: { tone: "professional", energy: "medium", targetAudience: "developers" }, + confidence: 0.85, + }, + metadata: { + branding: { + title: "Example", + description: "Example site", + favicon: "", + language: "en", + themeColor: "#0066cc", + ogTitle: "Example", + ogDescription: "Example site", + ogImage: "", + ogUrl: "https://example.com", + }, + }, + }, + }, + metadata: { contentType: "text/html" }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const brandingParams = { + url: "https://example.com", + formats: [{ type: "branding" as const }], + }; + + const res = await sdk.scrape(API_KEY, brandingParams); + + expect(res.status).toBe("success"); + expect(res.data?.results.branding?.data.colorScheme).toBe("light"); + expectRequest(0, "POST", "/scrape", brandingParams); + }); + + test("PDF document scraping", async () => { + const body = { + results: { + markdown: { data: ["# PDF Document\n\nThis is the content extracted from the PDF."] }, + }, + metadata: { + contentType: "application/pdf", + ocr: { + model: "gpt-4o", + pagesProcessed: 2, + pages: [ + { + index: 0, + images: [], + tables: [], + hyperlinks: [], + dimensions: { dpi: 72, height: 792, width: 612 }, + }, + { + index: 1, + images: [], + tables: [], + hyperlinks: [], + dimensions: { dpi: 72, height: 792, width: 612 }, + }, + ], + }, + }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const pdfParams = { + url: "https://pdfobject.com/pdf/sample.pdf", + contentType: "application/pdf" as const, + formats: [{ type: "markdown" as const }], + }; + + const res = await sdk.scrape(API_KEY, pdfParams); + + expect(res.status).toBe("success"); + expect(res.data?.metadata.contentType).toBe("application/pdf"); + expect(res.data?.metadata.ocr?.pagesProcessed).toBe(2); + expectRequest(0, "POST", "/scrape", pdfParams); + }); + + test("DOCX document scraping", async () => { + const body = { + results: { markdown: { data: ["# Word Document\n\nContent from DOCX file."] } }, + metadata: { + contentType: "application/vnd.openxmlformats-officedocument.wordprocessingml.document", + }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const docxParams = { + url: "https://example.com/document.docx", + contentType: + "application/vnd.openxmlformats-officedocument.wordprocessingml.document" as const, + formats: [{ type: "markdown" as const }], + }; + + const res = await sdk.scrape(API_KEY, docxParams); + + expect(res.status).toBe("success"); + expectRequest(0, "POST", "/scrape", docxParams); + }); + + test("image scraping with OCR", async () => { + const body = { + results: { markdown: { data: ["Text extracted from image via OCR"] } }, + metadata: { contentType: "image/png" }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const imageParams = { + url: "https://example.com/screenshot.png", + contentType: "image/png" as const, + formats: [{ type: "markdown" as const }], + }; + + const res = await sdk.scrape(API_KEY, imageParams); + + expect(res.status).toBe("success"); + expectRequest(0, "POST", "/scrape", imageParams); }); test("HTTP 401", async () => { fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce( json({ detail: "Invalid key" }, 401), ); - const res = await scrapegraphai.smartScraper(API_KEY, params); + const res = await sdk.scrape(API_KEY, params); expect(res.status).toBe("error"); expect(res.error).toContain("Invalid or missing API key"); @@ -73,7 +400,7 @@ describe("smartScraper", () => { test("HTTP 402", async () => { fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json({}, 402)); - const res = await scrapegraphai.smartScraper(API_KEY, params); + const res = await sdk.scrape(API_KEY, params); expect(res.status).toBe("error"); expect(res.error).toContain("Insufficient credits"); @@ -81,7 +408,7 @@ describe("smartScraper", () => { test("HTTP 422", async () => { fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json({}, 422)); - const res = await scrapegraphai.smartScraper(API_KEY, params); + const res = await sdk.scrape(API_KEY, params); expect(res.status).toBe("error"); expect(res.error).toContain("Invalid parameters"); @@ -89,7 +416,7 @@ describe("smartScraper", () => { test("HTTP 429", async () => { fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json({}, 429)); - const res = await scrapegraphai.smartScraper(API_KEY, params); + const res = await sdk.scrape(API_KEY, params); expect(res.status).toBe("error"); expect(res.error).toContain("Rate limited"); @@ -97,27 +424,17 @@ describe("smartScraper", () => { test("HTTP 500", async () => { fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json({}, 500)); - const res = await scrapegraphai.smartScraper(API_KEY, params); + const res = await sdk.scrape(API_KEY, params); expect(res.status).toBe("error"); expect(res.error).toContain("Server error"); }); - test("HTTP error with detail", async () => { - fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce( - json({ detail: "quota exceeded" }, 402), - ); - const res = await scrapegraphai.smartScraper(API_KEY, params); - - expect(res.status).toBe("error"); - expect(res.error).toContain("quota exceeded"); - }); - test("timeout", async () => { fetchSpy = spyOn(globalThis, "fetch").mockRejectedValueOnce( new DOMException("The operation was aborted", "TimeoutError"), ); - const res = await scrapegraphai.smartScraper(API_KEY, params); + const res = await sdk.scrape(API_KEY, params); expect(res.status).toBe("error"); expect(res.error).toBe("Request timed out"); @@ -125,211 +442,673 @@ describe("smartScraper", () => { test("network error", async () => { fetchSpy = spyOn(globalThis, "fetch").mockRejectedValueOnce(new Error("fetch failed")); - const res = await scrapegraphai.smartScraper(API_KEY, params); + const res = await sdk.scrape(API_KEY, params); expect(res.status).toBe("error"); expect(res.error).toBe("fetch failed"); }); }); -describe("searchScraper", () => { - const params = { user_prompt: "Best pizza in NYC" }; +describe("extract", () => { + const params = { url: "https://example.com", prompt: "Extract prices" }; test("success", async () => { const body = { - request_id: "abc-123", - status: "completed", - user_prompt: "Best pizza in NYC", - num_results: 3, - result: { answer: "Joe's Pizza" }, - markdown_content: null, - reference_urls: ["https://example.com"], - error: null, + raw: null, + json: { prices: [10, 20] }, + usage: { promptTokens: 100, completionTokens: 50 }, + metadata: { chunker: { chunks: [{ size: 1000 }] } }, }; fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); - const res = await scrapegraphai.searchScraper(API_KEY, params); + const res = await sdk.extract(API_KEY, params); expect(res.status).toBe("success"); expect(res.data).toEqual(body); - expectPost(0, "/searchscraper", params); + expectRequest(0, "POST", "/extract", params); + }); + + test("with HTML input instead of URL", async () => { + const body = { + raw: null, + json: { title: "Test Page" }, + usage: { promptTokens: 50, completionTokens: 20 }, + metadata: { chunker: { chunks: [{ size: 200 }] } }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const htmlParams = { + html: "Test Page

Hello

", + prompt: "Extract the page title", + }; + + const res = await sdk.extract(API_KEY, htmlParams); + + expect(res.status).toBe("success"); + expectRequest(0, "POST", "/extract", htmlParams); + }); + + test("with markdown input instead of URL", async () => { + const body = { + raw: null, + json: { headings: ["Introduction", "Methods"] }, + usage: { promptTokens: 30, completionTokens: 15 }, + metadata: { chunker: { chunks: [{ size: 100 }] } }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const mdParams = { + markdown: "# Introduction\n\nSome content.\n\n# Methods\n\nMore content.", + prompt: "Extract all headings", + }; + + const res = await sdk.extract(API_KEY, mdParams); + + expect(res.status).toBe("success"); + expectRequest(0, "POST", "/extract", mdParams); + }); + + test("with schema for structured output", async () => { + const body = { + raw: null, + json: { products: [{ name: "Widget", price: 29.99, inStock: true }] }, + usage: { promptTokens: 150, completionTokens: 80 }, + metadata: { chunker: { chunks: [{ size: 500 }] } }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const schemaParams = { + url: "https://example.com/products", + prompt: "Extract all products with their names, prices, and availability", + schema: { + type: "object", + properties: { + products: { + type: "array", + items: { + type: "object", + properties: { + name: { type: "string" }, + price: { type: "number" }, + inStock: { type: "boolean" }, + }, + }, + }, + }, + }, + }; + + const res = await sdk.extract(API_KEY, schemaParams); + + expect(res.status).toBe("success"); + expect(res.data?.json?.products).toHaveLength(1); + expectRequest(0, "POST", "/extract", schemaParams); + }); + + test("with fetchConfig and contentType for PDF", async () => { + const body = { + raw: "Raw text from PDF", + json: { sections: ["Abstract", "Introduction", "Conclusion"] }, + usage: { promptTokens: 200, completionTokens: 50 }, + metadata: { chunker: { chunks: [{ size: 2000 }] }, fetch: { provider: "playwright" } }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const pdfParams = { + url: "https://pdfobject.com/pdf/sample.pdf", + contentType: "application/pdf" as const, + prompt: "List all section headings in this document", + fetchConfig: { timeout: 60000 }, + }; + + const res = await sdk.extract(API_KEY, pdfParams); + + expect(res.status).toBe("success"); + expect(res.data?.raw).toBe("Raw text from PDF"); + expectRequest(0, "POST", "/extract", pdfParams); + }); + + test("with html mode options", async () => { + const body = { + raw: null, + json: { mainContent: "Article text without boilerplate" }, + usage: { promptTokens: 100, completionTokens: 30 }, + metadata: { chunker: { chunks: [{ size: 800 }] } }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const modeParams = { + url: "https://example.com/article", + prompt: "Extract the main article content", + mode: "reader" as const, + }; + + const res = await sdk.extract(API_KEY, modeParams); + + expect(res.status).toBe("success"); + expectRequest(0, "POST", "/extract", modeParams); }); }); -describe("markdownify", () => { - const params = { website_url: "https://example.com" }; +describe("search", () => { + const params = { query: "best pizza NYC" }; test("success", async () => { const body = { - request_id: "abc-123", - status: "completed", - website_url: "https://example.com", - result: "# Hello", - error: "", + results: [{ url: "https://example.com", title: "Pizza", content: "Great pizza" }], + metadata: { search: {}, pages: { requested: 3, scraped: 3 } }, }; fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); - const res = await scrapegraphai.markdownify(API_KEY, params); + const res = await sdk.search(API_KEY, params); expect(res.status).toBe("success"); expect(res.data).toEqual(body); - expectPost(0, "/markdownify", params); + expectRequest(0, "POST", "/search", params); + }); + + test("with numResults and format options", async () => { + const body = { + results: [ + { url: "https://example1.com", title: "Result 1", content: "

HTML content 1

" }, + { url: "https://example2.com", title: "Result 2", content: "

HTML content 2

" }, + { url: "https://example3.com", title: "Result 3", content: "

HTML content 3

" }, + { url: "https://example4.com", title: "Result 4", content: "

HTML content 4

" }, + { url: "https://example5.com", title: "Result 5", content: "

HTML content 5

" }, + ], + metadata: { search: { provider: "google" }, pages: { requested: 5, scraped: 5 } }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const searchParams = { + query: "typescript best practices", + numResults: 5, + format: "html" as const, + }; + + const res = await sdk.search(API_KEY, searchParams); + + expect(res.status).toBe("success"); + expect(res.data?.results).toHaveLength(5); + expectRequest(0, "POST", "/search", searchParams); + }); + + test("with prompt and schema for structured extraction", async () => { + const body = { + results: [{ url: "https://example.com", title: "Product", content: "Widget $29.99" }], + json: { products: [{ name: "Widget", price: 29.99 }] }, + usage: { promptTokens: 100, completionTokens: 30 }, + metadata: { + search: {}, + pages: { requested: 3, scraped: 3 }, + chunker: { chunks: [{ size: 500 }] }, + }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const searchParams = { + query: "buy widgets online", + prompt: "Extract product names and prices from search results", + schema: { + type: "object", + properties: { + products: { + type: "array", + items: { + type: "object", + properties: { + name: { type: "string" }, + price: { type: "number" }, + }, + }, + }, + }, + }, + }; + + const res = await sdk.search(API_KEY, searchParams); + + expect(res.status).toBe("success"); + expect(res.data?.json).toBeDefined(); + expectRequest(0, "POST", "/search", searchParams); + }); + + test("with location and time range filters", async () => { + const body = { + results: [ + { url: "https://news.example.com", title: "Breaking News", content: "Recent event" }, + ], + metadata: { search: {}, pages: { requested: 3, scraped: 3 } }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const searchParams = { + query: "local news", + locationGeoCode: "us", + timeRange: "past_24_hours" as const, + }; + + const res = await sdk.search(API_KEY, searchParams); + + expect(res.status).toBe("success"); + expectRequest(0, "POST", "/search", searchParams); + }); + + test("with fetchConfig and html mode", async () => { + const body = { + results: [{ url: "https://example.com", title: "Test", content: "# Clean content" }], + metadata: { search: {}, pages: { requested: 2, scraped: 2 } }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const searchParams = { + query: "test query", + numResults: 2, + mode: "prune" as const, + fetchConfig: { mode: "js" as const, timeout: 45000 }, + }; + + const res = await sdk.search(API_KEY, searchParams); + + expect(res.status).toBe("success"); + expectRequest(0, "POST", "/search", searchParams); }); }); -describe("scrape", () => { - const params = { website_url: "https://example.com" }; +describe("getCredits", () => { + test("success", async () => { + const body = { + remaining: 1000, + used: 500, + plan: "pro", + jobs: { crawl: { used: 1, limit: 5 }, monitor: { used: 2, limit: 10 } }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const res = await sdk.getCredits(API_KEY); + + expect(res.status).toBe("success"); + expect(res.data).toEqual(body); + expectRequest(0, "GET", "/credits"); + }); +}); +describe("checkHealth", () => { test("success", async () => { + const body = { status: "ok", uptime: 12345 }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const res = await sdk.checkHealth(API_KEY); + + expect(res.status).toBe("success"); + expect(res.data).toEqual(body); + expectRequest(0, "GET", "/healthz", undefined, HEALTH_BASE); + }); +}); + +describe("history", () => { + test("list success without params", async () => { const body = { - scrape_request_id: "abc-123", + data: [], + pagination: { page: 1, limit: 20, total: 0 }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const res = await sdk.history.list(API_KEY); + + expect(res.status).toBe("success"); + expect(res.data).toEqual(body); + expectRequest(0, "GET", "/history"); + }); + + test("list success with params", async () => { + const body = { + data: [], + pagination: { page: 2, limit: 10, total: 50 }, + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const res = await sdk.history.list(API_KEY, { page: 2, limit: 10, service: "scrape" }); + + expect(res.status).toBe("success"); + const [url] = fetchSpy.mock.calls[0] as [string, RequestInit]; + expect(url).toContain("page=2"); + expect(url).toContain("limit=10"); + expect(url).toContain("service=scrape"); + }); + + test("get success", async () => { + const body = { + id: "abc-123", + service: "scrape", status: "completed", - html: "...", - branding: null, - metadata: null, - error: "", + params: { url: "https://example.com" }, + result: {}, }; fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); - const res = await scrapegraphai.scrape(API_KEY, params); + const res = await sdk.history.get(API_KEY, "abc-123"); expect(res.status).toBe("success"); expect(res.data).toEqual(body); - expectPost(0, "/scrape", params); + expectRequest(0, "GET", "/history/abc-123"); }); }); describe("crawl", () => { - const params = { url: "https://example.com", prompt: "Extract main content" }; + const params = { url: "https://example.com" }; + + test("start success", async () => { + const body = { + id: "crawl-123", + status: "running", + total: 50, + finished: 0, + pages: [], + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const res = await sdk.crawl.start(API_KEY, params); - test("immediate completion", async () => { - const body = { status: "done", pages: [{ url: "https://example.com", content: "data" }] }; + expect(res.status).toBe("success"); + expect(res.data).toEqual(body); + expectRequest(0, "POST", "/crawl", params); + }); + + test("start with full config - formats and limits", async () => { + const body = { + id: "crawl-456", + status: "running", + total: 100, + finished: 0, + pages: [], + }; fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); - const res = await scrapegraphai.crawl(API_KEY, params); + const fullParams = { + url: "https://example.com", + formats: [ + { type: "markdown" as const, mode: "reader" as const }, + { type: "screenshot" as const, fullPage: false, width: 1280, height: 720, quality: 80 }, + ], + maxDepth: 3, + maxPages: 100, + maxLinksPerPage: 20, + }; + + const res = await sdk.crawl.start(API_KEY, fullParams); expect(res.status).toBe("success"); - expect(res.data as any).toEqual(body); - expectPost(0, "/crawl"); + expectRequest(0, "POST", "/crawl", fullParams); }); - test("polls with task_id", async () => { - fetchSpy = spyOn(globalThis, "fetch") - .mockResolvedValueOnce(json({ status: "pending", task_id: "crawl-99" })) - .mockResolvedValueOnce(json({ status: "done", task_id: "crawl-99", pages: [] })); + test("start with include/exclude patterns", async () => { + const body = { + id: "crawl-789", + status: "running", + total: 30, + finished: 0, + pages: [], + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const patternParams = { + url: "https://example.com", + includePatterns: ["/blog/*", "/docs/*"], + excludePatterns: ["/admin/*", "*.pdf"], + allowExternal: false, + }; - const res = await scrapegraphai.crawl(API_KEY, params); + const res = await sdk.crawl.start(API_KEY, patternParams); expect(res.status).toBe("success"); - expect(fetchSpy).toHaveBeenCalledTimes(2); - expectGet(1, "/crawl/crawl-99"); + expectRequest(0, "POST", "/crawl", patternParams); }); - test("calls onPoll callback", async () => { - const statuses: string[] = []; - fetchSpy = spyOn(globalThis, "fetch") - .mockResolvedValueOnce(json({ status: "pending", task_id: "crawl-99" })) - .mockResolvedValueOnce(json({ status: "done", task_id: "crawl-99", pages: [] })); + test("start with fetchConfig and contentTypes", async () => { + const body = { + id: "crawl-abc", + status: "running", + total: 50, + finished: 0, + pages: [], + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); - await scrapegraphai.crawl(API_KEY, params, (s) => statuses.push(s)); + const configParams = { + url: "https://example.com", + contentTypes: ["text/html" as const, "application/pdf" as const], + fetchConfig: { + mode: "js" as const, + stealth: true, + timeout: 45000, + wait: 1000, + }, + }; - expect(statuses).toEqual(["done"]); + const res = await sdk.crawl.start(API_KEY, configParams); + + expect(res.status).toBe("success"); + expectRequest(0, "POST", "/crawl", configParams); }); - test("poll failure", async () => { - fetchSpy = spyOn(globalThis, "fetch") - .mockResolvedValueOnce(json({ status: "pending", task_id: "crawl-99" })) - .mockResolvedValueOnce(json({ status: "failed", error: "Crawl exploded" })); + test("get success", async () => { + const body = { + id: "crawl-123", + status: "completed", + total: 10, + finished: 10, + pages: [{ url: "https://example.com", status: "completed" }], + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); - const res = await scrapegraphai.crawl(API_KEY, params); + const res = await sdk.crawl.get(API_KEY, "crawl-123"); - expect(res.status).toBe("error"); - expect(res.error).toBe("Crawl exploded"); + expect(res.status).toBe("success"); + expect(res.data).toEqual(body); + expectRequest(0, "GET", "/crawl/crawl-123"); + }); + + test("stop success", async () => { + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json({ ok: true })); + + const res = await sdk.crawl.stop(API_KEY, "crawl-123"); + + expect(res.status).toBe("success"); + expect(res.data).toEqual({ ok: true }); + expectRequest(0, "POST", "/crawl/crawl-123/stop"); + }); + + test("resume success", async () => { + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json({ ok: true })); + + const res = await sdk.crawl.resume(API_KEY, "crawl-123"); + + expect(res.status).toBe("success"); + expect(res.data).toEqual({ ok: true }); + expectRequest(0, "POST", "/crawl/crawl-123/resume"); + }); + + test("delete success", async () => { + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json({ ok: true })); + + const res = await sdk.crawl.delete(API_KEY, "crawl-123"); + + expect(res.status).toBe("success"); + expect(res.data).toEqual({ ok: true }); + expectRequest(0, "DELETE", "/crawl/crawl-123"); }); }); -describe("agenticScraper", () => { - const params = { url: "https://example.com", steps: ["Click login"] }; +describe("monitor", () => { + const createParams = { url: "https://example.com", interval: "0 * * * *" }; - test("success", async () => { + test("create success", async () => { const body = { - request_id: "abc-123", - status: "completed", - result: { screenshot: "base64..." }, - error: "", + cronId: "mon-123", + scheduleId: "sched-456", + interval: "0 * * * *", + status: "active", + config: createParams, + createdAt: "2024-01-01T00:00:00Z", + updatedAt: "2024-01-01T00:00:00Z", }; fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); - const res = await scrapegraphai.agenticScraper(API_KEY, params); + const res = await sdk.monitor.create(API_KEY, createParams); expect(res.status).toBe("success"); expect(res.data).toEqual(body); - expectPost(0, "/agentic-scrapper", params); + expectRequest(0, "POST", "/monitor", createParams); }); -}); -describe("generateSchema", () => { - const params = { user_prompt: "Schema for product" }; + test("create with multiple formats and webhook", async () => { + const fullParams = { + url: "https://example.com/prices", + name: "Price Monitor", + interval: "0 */6 * * *", + formats: [ + { type: "markdown" as const, mode: "reader" as const }, + { type: "json" as const, prompt: "Extract all product prices", mode: "normal" as const }, + { type: "screenshot" as const, fullPage: true, width: 1440, height: 900, quality: 90 }, + ], + webhookUrl: "https://hooks.example.com/notify", + }; + const body = { + cronId: "mon-456", + scheduleId: "sched-789", + interval: "0 */6 * * *", + status: "active", + config: fullParams, + createdAt: "2024-01-01T00:00:00Z", + updatedAt: "2024-01-01T00:00:00Z", + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); - test("success", async () => { + const res = await sdk.monitor.create(API_KEY, fullParams); + + expect(res.status).toBe("success"); + expectRequest(0, "POST", "/monitor", fullParams); + }); + + test("create with fetchConfig", async () => { + const configParams = { + url: "https://spa-example.com", + interval: "0 0 * * *", + fetchConfig: { + mode: "js" as const, + stealth: true, + wait: 3000, + scrolls: 5, + }, + }; const body = { - request_id: "abc-123", - status: "completed", - user_prompt: "Schema for product", - generated_schema: { type: "object" }, + cronId: "mon-789", + scheduleId: "sched-abc", + interval: "0 0 * * *", + status: "active", + config: configParams, + createdAt: "2024-01-01T00:00:00Z", + updatedAt: "2024-01-01T00:00:00Z", }; fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); - const res = await scrapegraphai.generateSchema(API_KEY, params); + const res = await sdk.monitor.create(API_KEY, configParams); + + expect(res.status).toBe("success"); + expectRequest(0, "POST", "/monitor", configParams); + }); + + test("list success", async () => { + const body = [ + { + cronId: "mon-123", + scheduleId: "sched-456", + interval: "0 * * * *", + status: "active", + }, + ]; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); + + const res = await sdk.monitor.list(API_KEY); expect(res.status).toBe("success"); expect(res.data).toEqual(body); - expectPost(0, "/generate_schema", params); + expectRequest(0, "GET", "/monitor"); }); -}); -describe("sitemap", () => { - const params = { website_url: "https://example.com" }; + test("get success", async () => { + const body = { + cronId: "mon-123", + scheduleId: "sched-456", + interval: "0 * * * *", + status: "active", + }; + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); - test("success", async () => { + const res = await sdk.monitor.get(API_KEY, "mon-123"); + + expect(res.status).toBe("success"); + expect(res.data).toEqual(body); + expectRequest(0, "GET", "/monitor/mon-123"); + }); + + test("update success", async () => { + const updateParams = { interval: "0 0 * * *" }; const body = { - request_id: "abc-123", - urls: ["https://example.com/a", "https://example.com/b"], + cronId: "mon-123", + scheduleId: "sched-456", + interval: "0 0 * * *", + status: "active", }; fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); - const res = await scrapegraphai.sitemap(API_KEY, params); + const res = await sdk.monitor.update(API_KEY, "mon-123", updateParams); expect(res.status).toBe("success"); expect(res.data).toEqual(body); - expectPost(0, "/sitemap", params); + expectRequest(0, "PATCH", "/monitor/mon-123", updateParams); }); -}); -describe("getCredits", () => { - test("success", async () => { - const body = { remaining_credits: 420, total_credits_used: 69 }; + test("delete success", async () => { + fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json({ ok: true })); + + const res = await sdk.monitor.delete(API_KEY, "mon-123"); + + expect(res.status).toBe("success"); + expect(res.data).toEqual({ ok: true }); + expectRequest(0, "DELETE", "/monitor/mon-123"); + }); + + test("pause success", async () => { + const body = { + cronId: "mon-123", + scheduleId: "sched-456", + interval: "0 * * * *", + status: "paused", + }; fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); - const res = await scrapegraphai.getCredits(API_KEY); + const res = await sdk.monitor.pause(API_KEY, "mon-123"); expect(res.status).toBe("success"); expect(res.data).toEqual(body); - expectGet(0, "/credits"); + expectRequest(0, "POST", "/monitor/mon-123/pause"); }); -}); -describe("checkHealth", () => { - test("success", async () => { - const body = { status: "healthy" }; + test("resume success", async () => { + const body = { + cronId: "mon-123", + scheduleId: "sched-456", + interval: "0 * * * *", + status: "active", + }; fetchSpy = spyOn(globalThis, "fetch").mockResolvedValueOnce(json(body)); - const res = await scrapegraphai.checkHealth(API_KEY); + const res = await sdk.monitor.resume(API_KEY, "mon-123"); expect(res.status).toBe("success"); expect(res.data).toEqual(body); - const [url, init] = fetchSpy.mock.calls[0] as [string, RequestInit]; - expect(url).toBe("https://api.scrapegraphai.com/healthz"); - expect(init.method).toBe("GET"); + expectRequest(0, "POST", "/monitor/mon-123/resume"); }); }); diff --git a/tsconfig.json b/tsconfig.json index ab488c9..234c173 100644 --- a/tsconfig.json +++ b/tsconfig.json @@ -3,6 +3,7 @@ "target": "ES2024", "module": "nodenext", "moduleResolution": "nodenext", + "types": ["node"], "strict": true, "declaration": true, "declarationMap": true,