Skip to content

Latest commit

 

History

History
297 lines (219 loc) · 8.31 KB

File metadata and controls

297 lines (219 loc) · 8.31 KB

ScrAPI logo

ScrAPI SDK for Python

License: MIT PyPI

ScrAPI is your ultimate web scraping solution, offering powerful, reliable, and easy-to-use features to extract data from any website effortlessly.

Official Python SDK for the ScrAPI web scraping service.

Table of contents

Installation

pip install scrapi-sdk

Install optional HTML helpers:

pip install "scrapi-sdk[html]"

Quick start (sync)

from scrapi_sdk import ScrapeRequest, ScrapiClient

with ScrapiClient("YOUR_API_KEY") as client:
    response = client.scrape(ScrapeRequest("https://deventerprise.com"))
    print(response.content if response else "No response")

Quick start (async)

import asyncio
from scrapi_sdk import AsyncScrapiClient


async def main() -> None:
    async with AsyncScrapiClient("YOUR_API_KEY") as client:
        response = await client.scrape("https://deventerprise.com")
        print(response.content if response else "No response")


asyncio.run(main())

Scrape request options

All options map to ScrAPI API fields while exposing Pythonic snake_case names.

Python field Type Description
url str URL to scrape. Relative inputs are normalized to https://....
response_format ResponseFormat Must be ResponseFormat.JSON when using this SDK client.
response_selector str | None CSS/XPath selector for response filtering.
cookies dict[str, str] Cookies sent to target request.
headers dict[str, str] Headers sent to target request.
request_method str HTTP method override (default GET).
request_body_base64 str | None Base64 request payload.
proxy_type ProxyType NONE, FREE, RESIDENTIAL, DATACENTER, TOR, CUSTOM.
proxy_country str | None Three-letter country code, e.g. USA.
proxy_city str | None City key (requires proxy_country).
custom_proxy_url str | None Custom proxy URL.
use_browser bool Enable browser mode.
solve_captchas bool Auto solve captchas (browser mode only).
include_screenshot bool Include screenshot URL in response (browser mode only).
include_pdf bool Include PDF URL in response (browser mode only).
include_video bool Include video URL in response (browser mode only).
accept_dialogs bool Accept browser dialogs/popups.
session_id str | None Reuse session context across calls.
callback_url str | None Webhook URL called when scrape completes.
browser_commands BrowserCommandList Ordered browser action commands.

Example:

from scrapi_sdk import ProxyType, ResponseFormat, ScrapeRequest

request = ScrapeRequest("https://deventerprise.com")
request.proxy_type = ProxyType.RESIDENTIAL
request.proxy_country = "USA"
request.use_browser = True
request.solve_captchas = True
request.include_screenshot = True
request.response_format = ResponseFormat.JSON

Browser commands

When use_browser=True, chain browser commands with BrowserCommandList:

from scrapi_sdk import ScrapeRequest

request = ScrapeRequest("https://www.roboform.com/filling-test-all-fields")
request.use_browser = True
request.accept_dialogs = True

request.browser_commands \
    .input("input[name='01___title']", "Mr") \
    .input("input[name='02frstname']", "Werner") \
    .input("input[name='04lastname']", "van Deventer") \
    .select("select[name='40cc__type']", "Discover") \
    .wait(3000) \
    .wait_for("input[type='reset']") \
    .click("input[type='reset']") \
    .wait(1000) \
    .scroll(1000) \
    .evaluate("console.log('any valid code...')")

Scrape response data

ScrapeResponse includes all API response details.

response = client.scrape("https://deventerprise.com")

if response:
    print(response.request_url)
    print(response.response_url)
    print(response.duration)
    print(response.attempts)
    print(response.credits_used)
    print(response.status_code)
    print(response.screenshot_url)
    print(response.pdf_url)
    print(response.video_url)
    print(response.content)
    print(response.content_hash)  # SHA1 of UTF-16LE content to match .NET SDK parity.

    for captcha_name, solved_count in response.captchas_solved.items():
        print(f"{captcha_name}: {solved_count}")

    for key, value in response.headers.items():
        print(f"{key}: {value}")

    for key, value in response.cookies.items():
        print(f"{key}: {value}")

    for message in response.error_messages or []:
        print(message)

If beautifulsoup4 is installed, response.html returns a parsed BeautifulSoup object.

Scrape request defaults

ScrapeRequestDefaults applies defaults to every new ScrapeRequest.

from scrapi_sdk import ProxyType, ScrapeRequest, ScrapeRequestDefaults

ScrapeRequestDefaults.proxy_type = ProxyType.RESIDENTIAL
ScrapeRequestDefaults.use_browser = True
ScrapeRequestDefaults.solve_captchas = True
ScrapeRequestDefaults.headers["Sample"] = "Custom-Value"

request = ScrapeRequest("https://deventerprise.com")
request.proxy_type = ProxyType.TOR  # explicit override

assert request.proxy_type == ProxyType.TOR
assert request.use_browser is True
assert request.solve_captchas is True
assert request.headers["Sample"] == "Custom-Value"

Lookups

Credit balance

balance = client.get_credit_balance()
print(balance)

Supported countries

countries = client.get_supported_countries()
for country in countries:
    print(country.key, country.name, country.proxy_count)

Supported cities

cities = client.get_supported_cities("USA")
for city in cities:
    print(city.key, city.name, city.proxy_count)

Exceptions

Any client/API errors are raised as ScrapiException with HTTP status code details.

from scrapi_sdk import ScrapeRequest, ScrapiClient, ScrapiException

with ScrapiClient("YOUR_API_KEY") as client:
    try:
        response = client.scrape(ScrapeRequest("https://deventerprise.com"))
    except ScrapiException as ex:
        print(f"Error ({ex.status_code}): {ex}")
        raise

HTML helper utilities (optional)

Install optional dependency first:

pip install "scrapi-sdk[html]"

Helpers exported from scrapi_sdk:

  • numbers_only(text, include_decimal_points=False, trim=True)
  • html_with_no_script(html)
  • next_element(node)
  • is_visible(node, check_parent_nodes=True)

Example:

from scrapi_sdk import html_with_no_script, numbers_only

print(numbers_only("USD 1,299.95", include_decimal_points=True))
print(html_with_no_script("<p>safe</p><script>alert(1)</script>"))

Sample app

A runnable sample app is included at examples/basic_scrape/main.py.

It reads SCRAPI_API_KEY and scrapes https://deventerprise.com.

Development

python -m venv .venv
. .venv/Scripts/activate  # Windows PowerShell: .venv\Scripts\Activate.ps1
pip install -e .[dev,html]
pytest

Build and publish

Local build

python -m pip install --upgrade pip build twine
python -m build
python -m twine check dist/*

Upload to TestPyPI

# PowerShell
$env:TWINE_USERNAME="__token__"
$env:TWINE_PASSWORD="pypi-..."
python -m twine upload -r testpypi dist/*

Upload to PyPI

# PowerShell
$env:TWINE_USERNAME="__token__"
$env:TWINE_PASSWORD="pypi-..."
python -m twine upload dist/*