Link Checker: rolling issue + Crossref DOI validation#757
Merged
richarddushime merged 4 commits intomainfrom May 7, 2026
Merged
Link Checker: rolling issue + Crossref DOI validation#757richarddushime merged 4 commits intomainfrom
richarddushime merged 4 commits intomainfrom
Conversation
The weekly Link Checker run now finds the most recent open "link-check"-labeled issue and edits its body in place, posting a short comment so subscribers see the refresh. Falls back to creating a fresh issue only when none is open. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
👍 All image files/references (if any) are in webp format, in line with our policy. |
Contributor
✅ Spell Check PassedNo spelling issues found when checking 2 changed file(s)! 🎉 |
Contributor
Author
|
This PR was attempted for staging deployment but had merge conflicts and was skipped. Attempted at: 2026-05-07 19:13:54 UTC Please resolve conflicts with the base branch and the deployment will be retried automatically. |
Lychee was following doi.org redirects to publisher sites and getting bot-blocked there, producing 403 noise and missed real-typo DOIs. The workflow now extracts every doi.org URL from the rendered site and checks it against the Crossref REST API (which doesn't bot-block); doi.org / dx.doi.org are excluded from lychee so the redirect path isn't double-checked. Implementation notes: - DOIs in the HTML are sometimes URL-encoded (e.g. %2F for /) — decode before re-encoding for the Crossref URL to avoid double-encoding. - Crossref rate-limits HEAD bursts even within the polite pool, so concurrency is capped at 4 and 429 responses are retried with exponential backoff. - A local sample of ~2700 DOIs runs in roughly 3 minutes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Crossref's API only knows Crossref-registered DOIs and returns 404 for
DOIs minted by other agencies (DataCite for Zenodo / OSF / institutional
repositories, JaLC, mEDRA, etc.), which produced 43/58 false positives
on the local site. The DOI Handle API at doi.org/api/handles/{doi} is
the authoritative cross-registrar resolver — responseCode 1 means the
handle exists, 100 means it does not.
Also fixes a regex bug that truncated SICI-style DOIs at the first ')':
the extractor now allows parens in the DOI body and strips trailing
unbalanced ')' / ']' afterwards, so DOIs like
10.1016/0277-9536(95)00127-S are captured intact.
On the local site this reduced the broken count from 58 -> 11, and the
validation step now runs in ~25s instead of ~160s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three improvements to the weekly Link Checker workflow:
Rolling issue instead of weekly duplicates. The workflow now finds the most recent open issue labelled `link-check` and edits its body in place, leaving a short comment so subscribers see the refresh. A new issue is created only if none is open. (Six older duplicates already closed manually; Link Checker Report #751 is now the rolling issue.)
Registry-agnostic DOI validation. `doi.org` / `dx.doi.org` are excluded from lychee (which was following the redirect to publishers and getting bot-blocked), and a new step extracts every DOI from the rendered HTML and checks each against the DOI Handle API at `doi.org/api/handles/{doi}`. The Handle API resolves Crossref, DataCite (Zenodo / OSF / institutional repositories), JaLC, mEDRA, etc. — `responseCode` `1` means the handle exists, `100` means it doesn't. The original Crossref-only check produced 43/58 false positives on the local site.
Balanced-paren DOI extraction. SICI-style DOIs like `10.1016/0277-9536(95)00127-S` legitimately contain parens. The extractor now allows them in the DOI body and strips only unbalanced trailing `)` / `]` afterwards.
Other implementation notes:
On the local site, validating ~2000 unique DOIs takes ~25s and reports 11 genuine 404s (vs. 58 false positives + true positives mixed together with the Crossref-only check).
Test plan
🤖 Generated with Claude Code