feat(scrapers): add anonymous Reddit scraper#1571
Conversation
|
@AnishSarkar22 is attempting to deploy a commit to the Rohan Verma's projects Team on Vercel. A member of the Team first needs to authorize it. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Description
app/proprietary/scrapers/reddit.loidsession warm-up, sticky proxy reuse, 403 IP rotation, 429 backoff, and paced.jsonfetching.Motivation and Context
FIX #
Screenshots
API Changes
Change Type
Testing Performed
Checklist
High-level PR Summary
This PR introduces a standalone anonymous Reddit scraper module that uses HTTP-only proxied requests (no browser automation) to scrape posts, comments, communities, and user profiles from Reddit. The scraper circumvents Reddit's deprecated unauthenticated JSON API by warming an anonymous
loidsession cookie viaold.reddit.comorsvc/shreddit, then fetching.jsonendpoints through sticky residential proxies. The implementation includes proxy rotation on 403 blocks, backoff on rate limits, concurrent fan-out of independent targets across a warm session pool, comprehensive offline unit tests, and a live end-to-end probe script. The module is not yet wired into routes or ingestion — it's a complete, tested standalone implementation ready for integration.⏱️ Estimated Review Time: 1-3 hours
💡 Review Order Suggestion
surfsense_backend/app/proprietary/scrapers/reddit/README.mdsurfsense_backend/app/proprietary/scrapers/reddit/__init__.pysurfsense_backend/app/proprietary/scrapers/reddit/schemas.pysurfsense_backend/app/proprietary/scrapers/reddit/url_resolver.pysurfsense_backend/app/proprietary/scrapers/reddit/parsers.pysurfsense_backend/app/proprietary/scrapers/reddit/fetch.pysurfsense_backend/app/proprietary/scrapers/reddit/scraper.pysurfsense_backend/tests/unit/scrapers/reddit/test_skeleton.pysurfsense_backend/tests/unit/scrapers/reddit/test_parsers.pysurfsense_backend/tests/unit/scrapers/reddit/test_fetch_resilience.pysurfsense_backend/scripts/e2e_reddit_scraper.pysurfsense_backend/tests/unit/scrapers/reddit/__init__.pysurfsense_backend/tests/unit/scrapers/reddit/fixtures/sample_comment.jsonsurfsense_backend/tests/unit/scrapers/reddit/fixtures/sample_listing.jsonsurfsense_backend/tests/unit/scrapers/reddit/fixtures/sample_post.json