diff --git a/.vale/styles/config/vocabularies/DependencyTrack/accept.txt b/.vale/styles/config/vocabularies/DependencyTrack/accept.txt index d4d0ec5..a830281 100644 --- a/.vale/styles/config/vocabularies/DependencyTrack/accept.txt +++ b/.vale/styles/config/vocabularies/DependencyTrack/accept.txt @@ -8,8 +8,7 @@ Aiven ApacheDS Artifactory Atlassian -BOM -BOMs +BOMs? Bento CPEs CWEs @@ -19,6 +18,7 @@ DEKs DNs Dependency-Track Diataxis +ETags? Entra Exploitability Fedora 389 Directory Server @@ -42,11 +42,9 @@ Lucene Mattermost MkDocs Modus -ModusCreate Modus[Cc]reate NVME Nixpkgs -Nixpkgs Novell OAuth OWASP @@ -60,8 +58,7 @@ PgBouncer Podman Postgres Protobuf -SBOM -SBOMs +SBOMs? SSDs Snyk Sonatype @@ -70,26 +67,21 @@ Timescale Tink Trivy URIs -VDR -VDRs +VDRs? VEX VulnDB Webex -[Aa]llowlist -[Aa]llowlists +[Aa]llowlists? [Aa]utodetect [Bb]ackoff [Bb]ackpressure -[Bb]locklist -[Bb]locklists +[Bb]locklists? [Bb]ooleans -[Cc]onfig -[Cc]onfigs +[Cc]onfigs? [Cc]ron [Cc]rontab [Dd]atacenter -[Dd]atasource -[Dd]atasources +[Dd]atasources? [Dd]eduplicat(e[ds]?|ing) [Dd]eduplication [Dd]efault[Mm]ode @@ -99,8 +91,7 @@ Webex [Dd]tapac [Ee]num [Ff]ileset -[Hh]ostname -[Hh]ostnames +[Hh]ostnames? [Ii]dempoten(t|cy) [Jj]etstack [Kk]enna @@ -111,11 +102,10 @@ Webex [Mm]isconfiguration [Mm]isconfigured [Mm]ixeway -[Nn]amespace +[Nn]amespaced? [Nn]amespaces [Oo]utbox [Pp]luggable -[Pp]ooler [Pp]oolers? [Pp]roxied [Rr]eadiness @@ -134,23 +124,20 @@ Webex [Tt]yposquatting [Uu]nmanaged [Uu]nprefixed -[Uu]psert -[Uu]pserts +[Uu]pserts? +[Vv]alidators? [Vv]ers [Vv]ulns apiserver autovacuum crypto eDirectory -keyset -keysets +keysets? keytool npm sAMAccountName timestamptz truststore -truststore userinfo -walkthrough -walkthroughs -zstd +walkthroughs? +zstd \ No newline at end of file diff --git a/docs/concepts/architecture/design/package-metadata-resolution.md b/docs/concepts/architecture/design/package-metadata-resolution.md index dab0eb2..a6bee36 100644 --- a/docs/concepts/architecture/design/package-metadata-resolution.md +++ b/docs/concepts/architecture/design/package-metadata-resolution.md @@ -4,15 +4,14 @@ The package metadata resolution system retrieves metadata for packages from upstream repositories. This includes latest available versions, artifact hashes, and publish timestamps. -The data is used for latest version checks, component age policies, and integrity verification. +Dependency-Track uses this data for latest version checks, component age policies, and integrity verification. -Resolution is orchestrated as a singleton [durable workflow](durable-execution.md) that -processes packages in controlled batches. The data model is documented in -ADR-015. +A singleton [durable workflow](durable-execution.md) orchestrates resolution and +processes packages in controlled batches. ADR-015 documents the data model. ## Responsible consumption of public infrastructure -Package registries like Maven Central, npm, and PyPI are shared public resources. +Public registries like Maven Central, npm, and PyPI share a finite pool of bandwidth across all consumers. Sonatype has documented that [1% of IP addresses account for 83% of Maven Central's total bandwidth][maven-overconsumption], and that registries have begun enforcing [organization-level throttling][maven-tragedy] in response, returning HTTP 429 errors to excessive consumers. @@ -21,11 +20,20 @@ For a system like Dependency-Track, which may track hundreds of thousands of com projects, this has direct architectural implications. Making an HTTP request per component on every BOM upload or scheduled analysis cycle is not viable. Neither is spawning hundreds of concurrent requests against upstream registries. Both patterns would quickly exhaust rate limits, especially -in larger deployments where multiple API server instances run concurrently. +in larger deployments where many API server instances run concurrently. -The resolution system is therefore designed around scheduled, controlled processing rather than -ad-hoc lookups. Components that need metadata resolution are identified in batches, resolved -sequentially per resolver, and persisted with enough provenance to avoid redundant upstream requests. +The resolution system thus favors scheduled, controlled processing over +ad-hoc lookups. The system identifies components needing metadata resolution in batches, resolves +them sequentially per resolver, and persists results with enough provenance to avoid redundant upstream requests. + +Resolvers also issue [conditional HTTP requests][rfc-conditional] when refreshing +data. A small shared cache stores each upstream response body together with its `ETag` and +`Last-Modified` validators. Refreshes after the in-process freshness window send `If-None-Match` +(or `If-Modified-Since`) and most often exchange a small `304 Not Modified` rather than a full +metadata document. This cuts upstream bandwidth without changing how often the system contacts the +registry, and on registries that exempt `304` responses from rate limiting (notably the GitHub API) +it also conserves quota. Sonatype's +[*Open is not costless: reclaiming sustainable infrastructure*][open-not-costless] post explains the motivation. ## Data model @@ -49,7 +57,7 @@ Refer to ADR-015 for the full rationale. Both tables use `COALESCE`-based upserts that preserve existing non-null values. A temporal guard (`WHERE "RESOLVED_AT" < EXCLUDED."RESOLVED_AT"`) prevents older results -from overwriting newer ones. Writes use PostgreSQL `UNNEST` to batch multiple rows +from overwriting newer ones. Writes use PostgreSQL `UNNEST` to batch many rows per statement, reducing round trips. ## Workflow @@ -60,7 +68,7 @@ Package metadata resolution is a singleton [durable workflow](durable-execution. The workflow uses a fixed instance ID (`resolve-package-metadata`). The durable execution engine enforces that only a single execution of a given workflow instance in non-terminal state can exist at -any point in time. Attempts to create a run while one is already active are silently deduplicated. +any moment. Attempts to create a run while one is already active are silently deduplicated. This guarantees that at most one resolution workflow is active across the entire cluster, regardless of how many API server instances are running. Concurrent resolution attempts are @@ -69,7 +77,7 @@ persistence. ### Triggers -The workflow is triggered in three situations: +Three situations trigger the workflow: | Trigger | When | |:----------------|:-------------------------------------------------------------------| @@ -120,10 +128,10 @@ A PURL is eligible for resolution if: * no `PACKAGE_METADATA` record exists for it, or * the corresponding `PACKAGE_METADATA` was last resolved over 24 hours ago. -Candidates are fetched in batches of 250 and grouped by resolver. Each PURL is matched -to the first resolver whose `normalize` method returns a non-null result. PURLs with no -matching resolver are grouped under an empty resolver name. The resolve activity persists -empty results for these so they don't re-appear as candidates in subsequent batches. +The activity fetches candidates in batches of 250 and groups them by resolver. Each PURL maps +to the first resolver whose `normalize` method returns a non-null result. The activity groups PURLs with no +matching resolver under an empty resolver name. The resolve activity persists +empty results for these so they don't re-appear as candidates in later batches. ### Resolution @@ -136,16 +144,16 @@ activities run concurrently. For each PURL, the activity: -1. Checks if it was already resolved within the last 5 minutes (idempotency guard for retries). +1. Checks if the resolver already produced a result within the last 5 minutes (idempotency guard for retries). 2. Normalizes the PURL via the resolver factory. 3. Looks up configured repositories for the PURL type, ordered by resolution priority. 4. Iterates repositories, respecting internal/external classification, invoking the resolver - until one succeeds or all are exhausted. + until one succeeds or none remain. 5. Buffers results and flushes to the database in batches of 25. If a resolver signals a retryable error (for example, HTTP 429), the activity flushes any buffered results and propagates the error. The durable execution engine then retries the activity with backoff. -Non-retryable errors for individual PURLs are caught and an empty result is persisted, +The activity catches non-retryable errors for individual PURLs and persists an empty result, preventing the PURL from becoming a candidate again immediately. #### Retry policy @@ -165,16 +173,16 @@ picks up the next batch. This prevents unbounded history growth: each run's even covers only one batch. Without this, a single run resolving thousands of packages would accumulate a large history that degrades replay performance. -It also creates a natural checkpoint. If the process is interrupted between batches, +It also creates a natural checkpoint. If the process stops between batches, the next run starts with a fresh candidate query, skipping already-resolved PURLs based on the `RESOLVED_AT` timestamps in the database. -The cycle continues until no more candidates are returned, at which point the workflow +The cycle continues until no candidates remain, at which point the workflow completes normally. ## Concurrency control -Concurrency is controlled at multiple levels: +Concurrency control operates at four levels: | Level | Mechanism | Effect | |:---------|:----------------------------------------------|:---------------------------------------------------| @@ -190,8 +198,8 @@ concurrency interact. Resolvers are pluggable via the plugin system. The API surface consists of two interfaces: -* **`PackageMetadataResolverFactory`**: Creates resolver instances. Declares whether a repository - is required, normalizes PURLs (returning `null` to signal non-support), and exposes the +* **`PackageMetadataResolverFactory`**: Creates resolver instances. Declares whether the resolver + needs a repository, normalizes PURLs (returning `null` to signal non-support), and exposes the extension name. * **`PackageMetadataResolver`**: Given a normalized PURL and an optional repository, returns `PackageMetadata` or `null`. @@ -200,9 +208,16 @@ A single resolver handles a given PURL (first match wins based on factory orderi Within that resolver, a single repository provides the result (first success wins based on the configured resolution order). -Resolvers own their caching strategy. The candidate query marks a PURL as eligible -after 24 hours, but a resolver can decide to serve from its own cache and skip the -upstream request entirely. +Resolvers route their upstream fetches through a shared HTTP-resource cache that +handles ETag and Last-Modified revalidation transparently. While an entry stays fresh the +cache serves the body without contacting the upstream; once stale but still cached, +the cache sends validators and a `304 Not Modified` replays the body, while a `200 OK` replaces +the entry. The cache negatively caches `404` and `410` responses so absent packages do not re-issue +requests within the freshness window. The cache uses a separate namespace per resolver and pluggable +between in-memory and database providers via the existing cache provider mechanism. + +The candidate query marks a PURL as eligible after 24 hours, but the resolver-side cache +typically prevents that from translating to a full upstream transfer. ## Maintenance @@ -211,7 +226,7 @@ A scheduled task cleans up orphaned metadata: 1. Deletes `PACKAGE_ARTIFACT_METADATA` rows where no `COMPONENT` with a matching PURL exists. 2. Deletes `PACKAGE_METADATA` rows where no `PACKAGE_ARTIFACT_METADATA` references them. -The two-step cascade prevents unbounded table growth as components are removed from the portfolio. +The two-step cascade prevents unbounded table growth as components leave the portfolio. Distributed locking ensures only one node executes cleanup at a time. ## Resiliency @@ -221,12 +236,14 @@ transparently: * If a node crashes mid-resolution, the workflow resumes from the last completed step on restart. Results already flushed to the database are not re-processed, because the 5-minute idempotency - window in the activity causes recently resolved PURLs to be skipped. + window in the activity causes the activity to skip recently resolved PURLs. * When resolution fails for a specific resolver even after exhausting retries, the workflow - catches the `ActivityFailureException`, logs the failure, and continues. Results from - other resolvers are persisted normally. + catches the `ActivityFailureException`, logs the failure, and continues. The workflow persists + results from other resolvers normally. * On graceful shutdown, the activity checks for thread interruption before each PURL, flushes buffered results, and propagates the interruption. [maven-overconsumption]: https://www.sonatype.com/blog/beyond-ips-addressing-organizational-overconsumption-in-maven-central [maven-tragedy]: https://www.sonatype.com/blog/maven-central-and-the-tragedy-of-the-commons +[open-not-costless]: https://www.sonatype.com/blog/open-is-not-costless-reclaiming-sustainable-infrastructure +[rfc-conditional]: https://www.rfc-editor.org/rfc/rfc7232