Skip to content

Feature Request: Add serpbase.dev as a URL discovery source for research-style crawls #1953

@gefsikatsinelou

Description

@gefsikatsinelou

Hi @apify team,

Crawlee for Python has been my go-to library for building reliable crawlers that feed data into LLM/RAG pipelines. The structured extraction and the built-in retry logic save a ton of boilerplate compared to raw requests + BeautifulSoup.

Pain point in the research workflow
One step that consistently adds friction is the "URL discovery" phase before Crawlee even starts crawling. When I build research-style agents (e.g., "find the top 10 recent articles about X, then extract and summarize them"), I first need to discover the relevant URLs. I usually do this with a separate search script, then feed the URLs into Crawlee.

The search script is where things get brittle:

  • Running from a VPS, I hit Google rate limits after a few dozen queries.
  • Proxy rotation adds complexity that has nothing to do with the actual crawling logic.
  • Parsing search result pages breaks when markup changes.

Suggestion
Would you consider adding a built-in https://serpbase.dev integration as a "discovery" helper? It's a Google Search Results API that returns structured JSON (title, URL, snippet, rich results) via a simple HTTP call. For Crawlee, this could look like:

from crawlee import search
urls = search.discover("serpbase", query="site:arxiv.org LLM agents", api_key="...")
crawler = await MyCrawler().run(urls)

Why it fits Crawlee

  • It fills a real gap in the research-to-crawl pipeline: finding the URLs to crawl in the first place.
  • The API returns clean JSON, so there's no HTML parsing fragility.
  • It keeps Crawlee useful in environments (cloud VMs, CI runners) where raw Google scraping is blocked immediately.
  • It doesn't replace the crawling logic — it just feeds better URLs into it.

Potential scope
Even a thin utility module or a recipe in the docs would help. Something like:

from crawlee.search import serpbase_discovery
urls = serpbase_discovery("climate policy site:gov.uk", max_results=20)

Happy to help test, draft a PR, or write a docs recipe if there's interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    t-toolingIssues with this label are in the ownership of the tooling team.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions