Skip to content

Make the distribution probe-to-RDF report an opt-in pipeline plugin (currently dead code) #463

Description

@ddeboer

Summary

The distribution probe-to-RDF reporting in @lde/pipeline is currently dead code: it generates schema:Action / schema:target / schema:error quads and is unit-tested, but is unreachable from the pipeline and not configurable, so no consumer can turn it on. It should become a first-class, opt-in capability (a pipeline plugin), not be quietly left disconnected.

Current state

packages/pipeline/src/distribution/report.ts exports probeResultsToQuads(), which converts probe results into schema:Action quads (schema:target per probed distribution, schema:error on failures). packages/pipeline/src/distribution/resolveDistributions.ts exports resolveDistributions(), a resolver wrapper that exposes those quads as a quads stream.

Neither is reachable from the pipeline:

  • Pipeline.processDataset (packages/pipeline/src/pipeline.ts) calls this.distributionResolver.resolve() directly and uses the result only to (a) select a distribution to analyse and (b) feed the progress reporter. It never calls resolveDistributions() or probeResultsToQuads(), so the quads are never handed to a writer and never reach the store.
  • A monorepo-wide search confirms the two symbols are referenced only by each other, the distribution/index.ts barrel export, and their tests — pipeline.ts has zero references.
  • The resolver already collects probeResults for every distribution; that data is currently discarded after reporting.

This surfaced downstream: the Dataset Knowledge Graph stories page had four charts built on the schema:Action model, all of which broke after the QLever migration (fixed separately by re-sourcing from the Dataset Register).

Why opt-in, not removal

It is tempting to delete this, but that reasoning only holds for consumers like NDE’s Dataset Knowledge Graph, where a separate Dataset Register owns distribution validation authoritatively (per-registration schema:status + SHACL validity). For such a consumer, re-emitting schema:Action from the pipeline is a weaker, duplicate signal.

But @lde is published for other consumers too. A standalone consumer that selects datasets from somewhere other than an NDE Dataset Register, and has no separate validation/registry app, has no other record of which distributions were reachable and why others failed — for them the probe report is the primary distribution-provenance signal, not a duplicate.

So the capability should be available but off by default:

  • DKG-style consumers (with a register) leave it out — the register stays authoritative, no duplicate signal.
  • Standalone consumers add the plugin and get schema:Action / schema:error distribution provenance in their store.

Proposed change

  • Expose the probe report as an opt-in pipeline plugin (alongside provenancePlugin), rather than calling it unconditionally.
  • The existing PipelinePlugin hook is beforeStageWrite (a transform of the post-stage quad stream). The probe report instead needs to inject quads derived from the resolver’s probeResults, which is not a stream transform — so add a small lifecycle hook (e.g. afterResolve / an additionalQuads source) that runs after distribution resolution and lets a plugin emit probeResultsToQuads(resolved.probeResults, dataset.iri) into the writer. This also makes the probeResults the resolver already collects actually usable.
  • Keep SparqlDistributionResolver / ImportResolver unchanged (still used for distribution selection).
  • Document the plugin so consumers can decide whether they need it.

Alternative (YAGNI)

If the maintainers prefer to keep the library lean and no consumer needs this today, remove probeResultsToQuads, resolveDistributions, their barrel exports, and tests, and re-add when a real consumer appears (git history preserves it). Given the code already exists, is tested, and there is a clean plugin seam, the opt-in route is low-cost and better serves the shared-library goal — but either beats the current unreachable-but-tested state.

Caveat to confirm

Verify nothing outside this monorepo still consumes the schema:Action distribution triples. They have not been written since the QLever migration, so any remaining consumer is already broken — but worth a quick check before changing the public surface.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Task.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions