Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 15 additions & 28 deletions .vale/styles/config/vocabularies/DependencyTrack/accept.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,7 @@ Aiven
ApacheDS
Artifactory
Atlassian
BOM
BOMs
BOMs?
Bento
CPEs
CWEs
Expand All @@ -19,6 +18,7 @@ DEKs
DNs
Dependency-Track
Diataxis
ETags?
Entra
Exploitability
Fedora 389 Directory Server
Expand All @@ -42,11 +42,9 @@ Lucene
Mattermost
MkDocs
Modus
ModusCreate
Modus[Cc]reate
NVME
Nixpkgs
Nixpkgs
Novell
OAuth
OWASP
Expand All @@ -60,8 +58,7 @@ PgBouncer
Podman
Postgres
Protobuf
SBOM
SBOMs
SBOMs?
SSDs
Snyk
Sonatype
Expand All @@ -70,26 +67,21 @@ Timescale
Tink
Trivy
URIs
VDR
VDRs
VDRs?
VEX
VulnDB
Webex
[Aa]llowlist
[Aa]llowlists
[Aa]llowlists?
[Aa]utodetect
[Bb]ackoff
[Bb]ackpressure
[Bb]locklist
[Bb]locklists
[Bb]locklists?
[Bb]ooleans
[Cc]onfig
[Cc]onfigs
[Cc]onfigs?
[Cc]ron
[Cc]rontab
[Dd]atacenter
[Dd]atasource
[Dd]atasources
[Dd]atasources?
[Dd]eduplicat(e[ds]?|ing)
[Dd]eduplication
[Dd]efault[Mm]ode
Expand All @@ -99,8 +91,7 @@ Webex
[Dd]tapac
[Ee]num
[Ff]ileset
[Hh]ostname
[Hh]ostnames
[Hh]ostnames?
[Ii]dempoten(t|cy)
[Jj]etstack
[Kk]enna
Expand All @@ -111,11 +102,10 @@ Webex
[Mm]isconfiguration
[Mm]isconfigured
[Mm]ixeway
[Nn]amespace
[Nn]amespaced?
[Nn]amespaces
[Oo]utbox
[Pp]luggable
[Pp]ooler
[Pp]oolers?
[Pp]roxied
[Rr]eadiness
Expand All @@ -134,23 +124,20 @@ Webex
[Tt]yposquatting
[Uu]nmanaged
[Uu]nprefixed
[Uu]psert
[Uu]pserts
[Uu]pserts?
[Vv]alidators?
[Vv]ers
[Vv]ulns
apiserver
autovacuum
crypto
eDirectory
keyset
keysets
keysets?
keytool
npm
sAMAccountName
timestamptz
truststore
truststore
userinfo
walkthrough
walkthroughs
zstd
walkthroughs?
zstd
79 changes: 48 additions & 31 deletions docs/concepts/architecture/design/package-metadata-resolution.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,14 @@

The package metadata resolution system retrieves metadata for packages from upstream
repositories. This includes latest available versions, artifact hashes, and publish timestamps.
The data is used for latest version checks, component age policies, and integrity verification.
Dependency-Track uses this data for latest version checks, component age policies, and integrity verification.

Resolution is orchestrated as a singleton [durable workflow](durable-execution.md) that
processes packages in controlled batches. The data model is documented in
ADR-015.
A singleton [durable workflow](durable-execution.md) orchestrates resolution and
processes packages in controlled batches. ADR-015 documents the data model.

## Responsible consumption of public infrastructure

Package registries like Maven Central, npm, and PyPI are shared public resources.
Public registries like Maven Central, npm, and PyPI share a finite pool of bandwidth across all consumers.
Sonatype has documented that [1% of IP addresses account for 83% of Maven Central's total bandwidth][maven-overconsumption],
and that registries have begun enforcing [organization-level throttling][maven-tragedy] in response,
returning HTTP 429 errors to excessive consumers.
Expand All @@ -21,11 +20,20 @@ For a system like Dependency-Track, which may track hundreds of thousands of com
projects, this has direct architectural implications. Making an HTTP request per component on every
BOM upload or scheduled analysis cycle is not viable. Neither is spawning hundreds of concurrent
requests against upstream registries. Both patterns would quickly exhaust rate limits, especially
in larger deployments where multiple API server instances run concurrently.
in larger deployments where many API server instances run concurrently.

The resolution system is therefore designed around scheduled, controlled processing rather than
ad-hoc lookups. Components that need metadata resolution are identified in batches, resolved
sequentially per resolver, and persisted with enough provenance to avoid redundant upstream requests.
The resolution system thus favors scheduled, controlled processing over
ad-hoc lookups. The system identifies components needing metadata resolution in batches, resolves
them sequentially per resolver, and persists results with enough provenance to avoid redundant upstream requests.

Resolvers also issue [conditional HTTP requests][rfc-conditional] when refreshing
data. A small shared cache stores each upstream response body together with its `ETag` and
`Last-Modified` validators. Refreshes after the in-process freshness window send `If-None-Match`
(or `If-Modified-Since`) and most often exchange a small `304 Not Modified` rather than a full
metadata document. This cuts upstream bandwidth without changing how often the system contacts the
registry, and on registries that exempt `304` responses from rate limiting (notably the GitHub API)
it also conserves quota. Sonatype's
Comment thread
nscuro marked this conversation as resolved.
[*Open is not costless: reclaiming sustainable infrastructure*][open-not-costless] post explains the motivation.

## Data model

Expand All @@ -49,7 +57,7 @@ Refer to ADR-015 for the full rationale.

Both tables use `COALESCE`-based upserts that preserve existing non-null values.
A temporal guard (`WHERE "RESOLVED_AT" < EXCLUDED."RESOLVED_AT"`) prevents older results
from overwriting newer ones. Writes use PostgreSQL `UNNEST` to batch multiple rows
from overwriting newer ones. Writes use PostgreSQL `UNNEST` to batch many rows
per statement, reducing round trips.

## Workflow
Expand All @@ -60,7 +68,7 @@ Package metadata resolution is a singleton [durable workflow](durable-execution.

The workflow uses a fixed instance ID (`resolve-package-metadata`). The durable execution engine enforces
that only a single execution of a given workflow instance in non-terminal state can exist at
any point in time. Attempts to create a run while one is already active are silently deduplicated.
any moment. Attempts to create a run while one is already active are silently deduplicated.

This guarantees that at most one resolution workflow is active across the entire cluster,
regardless of how many API server instances are running. Concurrent resolution attempts are
Expand All @@ -69,7 +77,7 @@ persistence.

### Triggers

The workflow is triggered in three situations:
Three situations trigger the workflow:

| Trigger | When |
|:----------------|:-------------------------------------------------------------------|
Expand Down Expand Up @@ -120,10 +128,10 @@ A PURL is eligible for resolution if:
* no `PACKAGE_METADATA` record exists for it, or
* the corresponding `PACKAGE_METADATA` was last resolved over 24 hours ago.

Candidates are fetched in batches of 250 and grouped by resolver. Each PURL is matched
to the first resolver whose `normalize` method returns a non-null result. PURLs with no
matching resolver are grouped under an empty resolver name. The resolve activity persists
empty results for these so they don't re-appear as candidates in subsequent batches.
The activity fetches candidates in batches of 250 and groups them by resolver. Each PURL maps
to the first resolver whose `normalize` method returns a non-null result. The activity groups PURLs with no
matching resolver under an empty resolver name. The resolve activity persists
empty results for these so they don't re-appear as candidates in later batches.

### Resolution

Expand All @@ -136,16 +144,16 @@ activities run concurrently.

For each PURL, the activity:

1. Checks if it was already resolved within the last 5 minutes (idempotency guard for retries).
1. Checks if the resolver already produced a result within the last 5 minutes (idempotency guard for retries).
2. Normalizes the PURL via the resolver factory.
3. Looks up configured repositories for the PURL type, ordered by resolution priority.
4. Iterates repositories, respecting internal/external classification, invoking the resolver
until one succeeds or all are exhausted.
until one succeeds or none remain.
5. Buffers results and flushes to the database in batches of 25.

If a resolver signals a retryable error (for example, HTTP 429), the activity flushes any buffered
results and propagates the error. The durable execution engine then retries the activity with backoff.
Non-retryable errors for individual PURLs are caught and an empty result is persisted,
The activity catches non-retryable errors for individual PURLs and persists an empty result,
preventing the PURL from becoming a candidate again immediately.

#### Retry policy
Expand All @@ -165,16 +173,16 @@ picks up the next batch. This prevents unbounded history growth: each run's even
covers only one batch. Without this, a single run resolving thousands of packages would
accumulate a large history that degrades replay performance.

It also creates a natural checkpoint. If the process is interrupted between batches,
It also creates a natural checkpoint. If the process stops between batches,
the next run starts with a fresh candidate query, skipping already-resolved PURLs
based on the `RESOLVED_AT` timestamps in the database.

The cycle continues until no more candidates are returned, at which point the workflow
The cycle continues until no candidates remain, at which point the workflow
completes normally.

## Concurrency control

Concurrency is controlled at multiple levels:
Concurrency control operates at four levels:

| Level | Mechanism | Effect |
|:---------|:----------------------------------------------|:---------------------------------------------------|
Expand All @@ -190,8 +198,8 @@ concurrency interact.

Resolvers are pluggable via the plugin system. The API surface consists of two interfaces:

* **`PackageMetadataResolverFactory`**: Creates resolver instances. Declares whether a repository
is required, normalizes PURLs (returning `null` to signal non-support), and exposes the
* **`PackageMetadataResolverFactory`**: Creates resolver instances. Declares whether the resolver
needs a repository, normalizes PURLs (returning `null` to signal non-support), and exposes the
extension name.
* **`PackageMetadataResolver`**: Given a normalized PURL and an optional repository,
returns `PackageMetadata` or `null`.
Expand All @@ -200,9 +208,16 @@ A single resolver handles a given PURL (first match wins based on factory orderi
Within that resolver, a single repository provides the result (first success wins based
on the configured resolution order).

Resolvers own their caching strategy. The candidate query marks a PURL as eligible
after 24 hours, but a resolver can decide to serve from its own cache and skip the
upstream request entirely.
Resolvers route their upstream fetches through a shared HTTP-resource cache that
handles ETag and Last-Modified revalidation transparently. While an entry stays fresh the
cache serves the body without contacting the upstream; once stale but still cached,
the cache sends validators and a `304 Not Modified` replays the body, while a `200 OK` replaces
the entry. The cache negatively caches `404` and `410` responses so absent packages do not re-issue
requests within the freshness window. The cache uses a separate namespace per resolver and pluggable
Comment thread
nscuro marked this conversation as resolved.
between in-memory and database providers via the existing cache provider mechanism.
Comment thread
nscuro marked this conversation as resolved.

The candidate query marks a PURL as eligible after 24 hours, but the resolver-side cache
typically prevents that from translating to a full upstream transfer.

## Maintenance

Expand All @@ -211,7 +226,7 @@ A scheduled task cleans up orphaned metadata:
1. Deletes `PACKAGE_ARTIFACT_METADATA` rows where no `COMPONENT` with a matching PURL exists.
2. Deletes `PACKAGE_METADATA` rows where no `PACKAGE_ARTIFACT_METADATA` references them.

The two-step cascade prevents unbounded table growth as components are removed from the portfolio.
The two-step cascade prevents unbounded table growth as components leave the portfolio.
Distributed locking ensures only one node executes cleanup at a time.

## Resiliency
Expand All @@ -221,12 +236,14 @@ transparently:

* If a node crashes mid-resolution, the workflow resumes from the last completed step on restart.
Results already flushed to the database are not re-processed, because the 5-minute idempotency
window in the activity causes recently resolved PURLs to be skipped.
window in the activity causes the activity to skip recently resolved PURLs.
* When resolution fails for a specific resolver even after exhausting retries, the workflow
catches the `ActivityFailureException`, logs the failure, and continues. Results from
other resolvers are persisted normally.
catches the `ActivityFailureException`, logs the failure, and continues. The workflow persists
results from other resolvers normally.
* On graceful shutdown, the activity checks for thread interruption before each PURL,
flushes buffered results, and propagates the interruption.

[maven-overconsumption]: https://www.sonatype.com/blog/beyond-ips-addressing-organizational-overconsumption-in-maven-central
[maven-tragedy]: https://www.sonatype.com/blog/maven-central-and-the-tragedy-of-the-commons
[open-not-costless]: https://www.sonatype.com/blog/open-is-not-costless-reclaiming-sustainable-infrastructure
[rfc-conditional]: https://www.rfc-editor.org/rfc/rfc7232