Skip to content

Link Checker: rolling issue + Crossref DOI validation#757

Merged
richarddushime merged 4 commits intomainfrom
link-check-rolling-issue
May 7, 2026
Merged

Link Checker: rolling issue + Crossref DOI validation#757
richarddushime merged 4 commits intomainfrom
link-check-rolling-issue

Conversation

@LukasWallrich
Copy link
Copy Markdown
Contributor

@LukasWallrich LukasWallrich commented May 1, 2026

Summary

Three improvements to the weekly Link Checker workflow:

  1. Rolling issue instead of weekly duplicates. The workflow now finds the most recent open issue labelled `link-check` and edits its body in place, leaving a short comment so subscribers see the refresh. A new issue is created only if none is open. (Six older duplicates already closed manually; Link Checker Report #751 is now the rolling issue.)

  2. Registry-agnostic DOI validation. `doi.org` / `dx.doi.org` are excluded from lychee (which was following the redirect to publishers and getting bot-blocked), and a new step extracts every DOI from the rendered HTML and checks each against the DOI Handle API at `doi.org/api/handles/{doi}`. The Handle API resolves Crossref, DataCite (Zenodo / OSF / institutional repositories), JaLC, mEDRA, etc. — `responseCode` `1` means the handle exists, `100` means it doesn't. The original Crossref-only check produced 43/58 false positives on the local site.

  3. Balanced-paren DOI extraction. SICI-style DOIs like `10.1016/0277-9536(95)00127-S` legitimately contain parens. The extractor now allows them in the DOI body and strips only unbalanced trailing `)` / `]` afterwards.

    Other implementation notes:

    • DOIs in HTML are sometimes URL-encoded (e.g. `%2F` for `/`) — decode before re-encoding for the API call to avoid double-encoding.
    • 429 / 5xx retried with exponential backoff; concurrency capped at 6.
    • Polite User-Agent uses `info@forrt.org`.

    On the local site, validating ~2000 unique DOIs takes ~25s and reports 11 genuine 404s (vs. 58 false positives + true positives mixed together with the Crossref-only check).

Test plan

  • Trigger via `workflow_dispatch` and confirm Link Checker Report #751 is updated rather than a new issue created.
  • Confirm the issue body's "Broken DOIs" section now flags only genuine typos (no Zenodo / OSF / Cambridge / etc. false positives).
  • Spot-check a couple of the reported broken DOIs to verify they really are unregistered.

🤖 Generated with Claude Code

The weekly Link Checker run now finds the most recent open
"link-check"-labeled issue and edits its body in place, posting a
short comment so subscribers see the refresh. Falls back to creating
a fresh issue only when none is open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LukasWallrich LukasWallrich requested a review from a team as a code owner May 1, 2026 10:33
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

👍 All image files/references (if any) are in webp format, in line with our policy.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

✅ Spell Check Passed

No spelling issues found when checking 2 changed file(s)! 🎉

@LukasWallrich
Copy link
Copy Markdown
Contributor Author

LukasWallrich commented May 1, 2026

⚠️ Staging Deployment Status

This PR was attempted for staging deployment but had merge conflicts and was skipped.

Attempted at: 2026-05-07 19:13:54 UTC
Staging URL: https://staging.forrt.org

Please resolve conflicts with the base branch and the deployment will be retried automatically.

Lychee was following doi.org redirects to publisher sites and getting
bot-blocked there, producing 403 noise and missed real-typo DOIs. The
workflow now extracts every doi.org URL from the rendered site and
checks it against the Crossref REST API (which doesn't bot-block);
doi.org / dx.doi.org are excluded from lychee so the redirect path
isn't double-checked.

Implementation notes:
- DOIs in the HTML are sometimes URL-encoded (e.g. %2F for /) — decode
  before re-encoding for the Crossref URL to avoid double-encoding.
- Crossref rate-limits HEAD bursts even within the polite pool, so
  concurrency is capped at 4 and 429 responses are retried with
  exponential backoff.
- A local sample of ~2700 DOIs runs in roughly 3 minutes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LukasWallrich LukasWallrich changed the title Update existing link-check issue instead of opening weekly duplicates Link Checker: rolling issue + Crossref DOI validation May 1, 2026
@LukasWallrich LukasWallrich mentioned this pull request May 1, 2026
71 tasks
Crossref's API only knows Crossref-registered DOIs and returns 404 for
DOIs minted by other agencies (DataCite for Zenodo / OSF / institutional
repositories, JaLC, mEDRA, etc.), which produced 43/58 false positives
on the local site. The DOI Handle API at doi.org/api/handles/{doi} is
the authoritative cross-registrar resolver — responseCode 1 means the
handle exists, 100 means it does not.

Also fixes a regex bug that truncated SICI-style DOIs at the first ')':
the extractor now allows parens in the DOI body and strips trailing
unbalanced ')' / ']' afterwards, so DOIs like
10.1016/0277-9536(95)00127-S are captured intact.

On the local site this reduced the broken count from 58 -> 11, and the
validation step now runs in ~25s instead of ~160s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@richarddushime richarddushime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@richarddushime richarddushime merged commit 4b79323 into main May 7, 2026
5 checks passed
@richarddushime richarddushime deleted the link-check-rolling-issue branch May 7, 2026 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants