papers: expand crawl with CSV/EML missed-paper seeds by AmitMY · Pull Request #169 · sign-language-processing/sign-language-processing.github.io

AmitMY · 2026-06-19T11:17:01Z

What

Cross-referenced two external Google Scholar exports against the papers crawl to find sign-language papers we never reached, then fed the recoverable ones back in.

Sources: csv-files.zip (Publish-or-Perish title exports for "sign language" + recognition/translation, pre-2000→2023) and eml-files.zip (223 "sign language" Scholar alert emails, 2025–26) — 5,585 unique candidate titles, ~44% not in the crawl.
New scripts/resolve_seeds.py: resolves titles → Semantic Scholar paperIds via the title-match endpoint and injects new ids into state/frontier.json (resumable).
Took the 1,568 missed post-2014 SL titles (≈257 non-SL false positives — semiotics, animal communication, generic gesture, audiology, spam — excluded via an LLM title judge), resolved 634 to SS ids, added 602 new seeds, and drained the crawl.

Result (`state.tar.gz` refreshed)

	Before	After	Δ
Total rendered papers	15,575	16,137	+562
Papers ≥ 2014	11,282	11,823	+541

Notes

The expansion confirmed the existing crawl had already captured essentially the entire reachable SL citation closure: of 555 newly-expanded nodes, only 24 surfaced any new SL neighbor (37 total), and the BFS hit a fixed point in 9 iterations. The added papers were peripheral leaves nothing previously pointed to.
The remaining ~933 unresolved titles are not indexed on Semantic Scholar (recent theses, local/regional journals) — unreachable by an SS-based crawl, left out by design.

🤖 Generated with Claude Code

Cross-referenced two external Google Scholar exports (Publish-or-Perish title CSVs + "sign language" alert EMLs) against the crawl to find SL papers we never reached. - New `scripts/resolve_seeds.py`: resolves paper titles to Semantic Scholar paperIds via the title-match endpoint and injects the new ids into state/frontier.json (resumable, no year filter). - Took the 1,568 missed post-2014 SL titles (non-SL false positives excluded via an LLM title judge), resolved 634 to SS ids, added 602 new seeds, and drained the crawl. Result (state.tar.gz refreshed): 15,575 -> 16,137 rendered papers (+562; +541 in the >=2014 window). The remaining ~933 unresolved titles are not indexed on Semantic Scholar (recent theses, local journals), so they're unreachable by the SS crawl. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AmitMY merged commit dfc5ed5 into master Jun 20, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

papers: expand crawl with CSV/EML missed-paper seeds#169

papers: expand crawl with CSV/EML missed-paper seeds#169
AmitMY merged 1 commit into
masterfrom
papers/expand-csv-eml-seeds

AmitMY commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

AmitMY commented Jun 19, 2026

What

Result (state.tar.gz refreshed)

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Result (`state.tar.gz` refreshed)