Hi @apify team,
Crawlee for Python has been my go-to library for building reliable crawlers that feed data into LLM/RAG pipelines. The structured extraction and the built-in retry logic save a ton of boilerplate compared to raw requests + BeautifulSoup.
Pain point in the research workflow
One step that consistently adds friction is the "URL discovery" phase before Crawlee even starts crawling. When I build research-style agents (e.g., "find the top 10 recent articles about X, then extract and summarize them"), I first need to discover the relevant URLs. I usually do this with a separate search script, then feed the URLs into Crawlee.
The search script is where things get brittle:
- Running from a VPS, I hit Google rate limits after a few dozen queries.
- Proxy rotation adds complexity that has nothing to do with the actual crawling logic.
- Parsing search result pages breaks when markup changes.
Suggestion
Would you consider adding a built-in https://serpbase.dev integration as a "discovery" helper? It's a Google Search Results API that returns structured JSON (title, URL, snippet, rich results) via a simple HTTP call. For Crawlee, this could look like:
from crawlee import search
urls = search.discover("serpbase", query="site:arxiv.org LLM agents", api_key="...")
crawler = await MyCrawler().run(urls)
Why it fits Crawlee
- It fills a real gap in the research-to-crawl pipeline: finding the URLs to crawl in the first place.
- The API returns clean JSON, so there's no HTML parsing fragility.
- It keeps Crawlee useful in environments (cloud VMs, CI runners) where raw Google scraping is blocked immediately.
- It doesn't replace the crawling logic — it just feeds better URLs into it.
Potential scope
Even a thin utility module or a recipe in the docs would help. Something like:
from crawlee.search import serpbase_discovery
urls = serpbase_discovery("climate policy site:gov.uk", max_results=20)
Happy to help test, draft a PR, or write a docs recipe if there's interest.
Hi @apify team,
Crawlee for Python has been my go-to library for building reliable crawlers that feed data into LLM/RAG pipelines. The structured extraction and the built-in retry logic save a ton of boilerplate compared to raw requests + BeautifulSoup.
Pain point in the research workflow
One step that consistently adds friction is the "URL discovery" phase before Crawlee even starts crawling. When I build research-style agents (e.g., "find the top 10 recent articles about X, then extract and summarize them"), I first need to discover the relevant URLs. I usually do this with a separate search script, then feed the URLs into Crawlee.
The search script is where things get brittle:
Suggestion
Would you consider adding a built-in https://serpbase.dev integration as a "discovery" helper? It's a Google Search Results API that returns structured JSON (title, URL, snippet, rich results) via a simple HTTP call. For Crawlee, this could look like:
Why it fits Crawlee
Potential scope
Even a thin utility module or a recipe in the docs would help. Something like:
Happy to help test, draft a PR, or write a docs recipe if there's interest.