Zenodo preservation-mirror foundation: client + manifest extension#811
Open
Zenodo preservation-mirror foundation: client + manifest extension#811
Conversation
Groundwork for the Zenodo upload workstream (issue #810): durable mirror of each certified microdata release to a preservation-grade host (CERN/OpenAIRE-operated, DOI-minting) so TRO citation URLs stay verifiable decades from now even if HuggingFace changes its hosting. - New policyengine_us_data/utils/zenodo_client.py: typed wrapper around the Zenodo REST API. One public function, create_and_publish_deposit(), handles the four-step Zenodo flow (create deposit, upload files, set metadata, publish) and returns the version + concept DOIs plus per-file download URLs and checksums. Env-var gated: ZENODO_ACCESS_TOKEN must be set or the function raises ZenodoNotConfigured, which callers should treat as 'preservation mirroring disabled for this release' rather than a failure. - Extends build_release_manifest() with two new kwargs: preservation_mirrors_by_artifact (per-artifact Zenodo or other mirror metadata) and preservation_dois (release-level Zenodo DOIs). Populates the fields introduced in PolicyEngine/policyengine.py#317 on the emitted manifest JSON. - 11 zenodo-client tests (happy path, missing token, missing file, API error wrapping, metadata payload serialization, env-var handling). 3 release-manifest tests (no fields when not provided, per-artifact mirror preserved, empty list treated as absent). - Full unit suite green (853 passed, 3 pre-existing skips). Modal-build wiring is deferred to a follow-up PR that requires a real Zenodo access token and a sandbox test round-trip. This commit is the contract + client + tests, with no production behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Scope
This PR is the foundation — contract + client + tests — with no production behavior change. Modal-build wiring is a follow-up PR that requires a production Zenodo access token and a sandbox round-trip to verify. Shipping it in two PRs keeps this one reviewable without needing secrets.
Why
2026-04-21 meeting with Lars Vilhuber (AEA Data Editor): HuggingFace doesn't publish a preservation commitment. A TRO citation URL that resolves only through HF can 404 decades from now. Zenodo (CERN / OpenAIRE-operated, DOI-minting) is the reference preservation-grade host Lars pointed at. Fixes #810.
Depends on
Test plan
🤖 Generated with Claude Code