Skip to content

papers: expand crawl with CSV/EML missed-paper seeds#169

Merged
AmitMY merged 1 commit into
masterfrom
papers/expand-csv-eml-seeds
Jun 20, 2026
Merged

papers: expand crawl with CSV/EML missed-paper seeds#169
AmitMY merged 1 commit into
masterfrom
papers/expand-csv-eml-seeds

Conversation

@AmitMY

@AmitMY AmitMY commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

What

Cross-referenced two external Google Scholar exports against the papers crawl to find sign-language papers we never reached, then fed the recoverable ones back in.

  • Sources: csv-files.zip (Publish-or-Perish title exports for "sign language" + recognition/translation, pre-2000→2023) and eml-files.zip (223 "sign language" Scholar alert emails, 2025–26) — 5,585 unique candidate titles, ~44% not in the crawl.
  • New scripts/resolve_seeds.py: resolves titles → Semantic Scholar paperIds via the title-match endpoint and injects new ids into state/frontier.json (resumable).
  • Took the 1,568 missed post-2014 SL titles (≈257 non-SL false positives — semiotics, animal communication, generic gesture, audiology, spam — excluded via an LLM title judge), resolved 634 to SS ids, added 602 new seeds, and drained the crawl.

Result (state.tar.gz refreshed)

Before After Δ
Total rendered papers 15,575 16,137 +562
Papers ≥ 2014 11,282 11,823 +541

Notes

  • The expansion confirmed the existing crawl had already captured essentially the entire reachable SL citation closure: of 555 newly-expanded nodes, only 24 surfaced any new SL neighbor (37 total), and the BFS hit a fixed point in 9 iterations. The added papers were peripheral leaves nothing previously pointed to.
  • The remaining ~933 unresolved titles are not indexed on Semantic Scholar (recent theses, local/regional journals) — unreachable by an SS-based crawl, left out by design.

🤖 Generated with Claude Code

Cross-referenced two external Google Scholar exports (Publish-or-Perish
title CSVs + "sign language" alert EMLs) against the crawl to find SL
papers we never reached.

- New `scripts/resolve_seeds.py`: resolves paper titles to Semantic
  Scholar paperIds via the title-match endpoint and injects the new
  ids into state/frontier.json (resumable, no year filter).
- Took the 1,568 missed post-2014 SL titles (non-SL false positives
  excluded via an LLM title judge), resolved 634 to SS ids, added 602
  new seeds, and drained the crawl.

Result (state.tar.gz refreshed): 15,575 -> 16,137 rendered papers
(+562; +541 in the >=2014 window). The remaining ~933 unresolved
titles are not indexed on Semantic Scholar (recent theses, local
journals), so they're unreachable by the SS crawl.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@AmitMY AmitMY merged commit dfc5ed5 into master Jun 20, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant