papers: expand crawl with CSV/EML missed-paper seeds#169
Merged
Conversation
Cross-referenced two external Google Scholar exports (Publish-or-Perish title CSVs + "sign language" alert EMLs) against the crawl to find SL papers we never reached. - New `scripts/resolve_seeds.py`: resolves paper titles to Semantic Scholar paperIds via the title-match endpoint and injects the new ids into state/frontier.json (resumable, no year filter). - Took the 1,568 missed post-2014 SL titles (non-SL false positives excluded via an LLM title judge), resolved 634 to SS ids, added 602 new seeds, and drained the crawl. Result (state.tar.gz refreshed): 15,575 -> 16,137 rendered papers (+562; +541 in the >=2014 window). The remaining ~933 unresolved titles are not indexed on Semantic Scholar (recent theses, local journals), so they're unreachable by the SS crawl. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Cross-referenced two external Google Scholar exports against the papers crawl to find sign-language papers we never reached, then fed the recoverable ones back in.
csv-files.zip(Publish-or-Perish title exports for"sign language" + recognition/translation, pre-2000→2023) andeml-files.zip(223 "sign language" Scholar alert emails, 2025–26) — 5,585 unique candidate titles, ~44% not in the crawl.scripts/resolve_seeds.py: resolves titles → Semantic Scholar paperIds via the title-match endpoint and injects new ids intostate/frontier.json(resumable).Result (
state.tar.gzrefreshed)Notes
🤖 Generated with Claude Code