Skip to content

feat(scrapers): add anonymous Reddit scraper#1571

Open
AnishSarkar22 wants to merge 16 commits into
MODSetter:ci_mvpfrom
AnishSarkar22:feat/reddit-scraping
Open

feat(scrapers): add anonymous Reddit scraper#1571
AnishSarkar22 wants to merge 16 commits into
MODSetter:ci_mvpfrom
AnishSarkar22:feat/reddit-scraping

Conversation

@AnishSarkar22

@AnishSarkar22 AnishSarkar22 commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator

Description

  • Adds an anonymous, proxy-aware Reddit scraper under app/proprietary/scrapers/reddit.
  • Implements loid session warm-up, sticky proxy reuse, 403 IP rotation, 429 backoff, and paced .json fetching.
  • Adds Reddit scraper input/output schemas with an anonymous-only request contract and no authentication fields.
  • Adds URL resolution for Reddit posts, subreddits, users, and search pages.
  • Adds JSON parsers for posts, comments, communities, media fields, pagination cursors, and flattened comment trees.
  • Adds the scraper orchestrator for post, subreddit, user, and search flows with fan-out concurrency and pagination limits.
  • Adds a manual live e2e script for validating warm-up, scrape flows, and fixture generation.
  • Adds offline unit coverage for schemas, URL resolution, parser mappings, fixtures, fetch resilience, proxy rotation, backoff, and fan-out behavior.

Motivation and Context

FIX #

Screenshots

API Changes

  • This PR includes API changes

Change Type

  • Bug fix
  • New feature
  • Performance improvement
  • Refactoring
  • Documentation
  • Dependency/Build system
  • Breaking change
  • Other (specify):

Testing Performed

  • Tested locally
  • Manual/QA verification

Checklist

  • Follows project coding standards and conventions
  • Documentation updated as needed
  • Dependencies updated as needed
  • No lint/build errors or new warnings
  • All relevant tests are passing

High-level PR Summary

This PR introduces a standalone anonymous Reddit scraper module that uses HTTP-only proxied requests (no browser automation) to scrape posts, comments, communities, and user profiles from Reddit. The scraper circumvents Reddit's deprecated unauthenticated JSON API by warming an anonymous loid session cookie via old.reddit.com or svc/shreddit, then fetching .json endpoints through sticky residential proxies. The implementation includes proxy rotation on 403 blocks, backoff on rate limits, concurrent fan-out of independent targets across a warm session pool, comprehensive offline unit tests, and a live end-to-end probe script. The module is not yet wired into routes or ingestion — it's a complete, tested standalone implementation ready for integration.

⏱️ Estimated Review Time: 1-3 hours

💡 Review Order Suggestion
Order File Path
1 surfsense_backend/app/proprietary/scrapers/reddit/README.md
2 surfsense_backend/app/proprietary/scrapers/reddit/__init__.py
3 surfsense_backend/app/proprietary/scrapers/reddit/schemas.py
4 surfsense_backend/app/proprietary/scrapers/reddit/url_resolver.py
5 surfsense_backend/app/proprietary/scrapers/reddit/parsers.py
6 surfsense_backend/app/proprietary/scrapers/reddit/fetch.py
7 surfsense_backend/app/proprietary/scrapers/reddit/scraper.py
8 surfsense_backend/tests/unit/scrapers/reddit/test_skeleton.py
9 surfsense_backend/tests/unit/scrapers/reddit/test_parsers.py
10 surfsense_backend/tests/unit/scrapers/reddit/test_fetch_resilience.py
11 surfsense_backend/scripts/e2e_reddit_scraper.py
12 surfsense_backend/tests/unit/scrapers/reddit/__init__.py
13 surfsense_backend/tests/unit/scrapers/reddit/fixtures/sample_comment.json
14 surfsense_backend/tests/unit/scrapers/reddit/fixtures/sample_listing.json
15 surfsense_backend/tests/unit/scrapers/reddit/fixtures/sample_post.json

Need help? Join our Discord

@vercel

vercel Bot commented Jul 4, 2026

Copy link
Copy Markdown

@AnishSarkar22 is attempting to deploy a commit to the Rohan Verma's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai

coderabbitai Bot commented Jul 4, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9f977e63-3289-477d-960f-4d921d2a10ea

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@AnishSarkar22 AnishSarkar22 changed the title Feat/reddit scraping feat(scrapers): add anonymous Reddit scraper Jul 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant