feat(request-cost): publish aggregate token telemetry to external endpoint#35749
Conversation
…point Add a scheduled push of the per-window + lifetime request-cost (token) counters to an external REST collector. The work runs on the existing monitor scheduler tick and is dispatched to DotConcurrentFactory's submitter so HTTP latency never blocks the monitor thread. - RequestCostSnapshot: immutable JSON payload (clusterId, environmentId, timestamp, window + lifetime request/token counters + averages). - RequestCostPublisher: @ApplicationScoped CDI bean. Activates implicitly when REQUEST_COST_PUSH_URL and REQUEST_COST_PUSH_TOKEN are both set. POSTs via CircuitBreakerUrl with bearer auth; transport failures are warnEvery-rate limited and the snapshot is dropped (observational telemetry). - RequestCostApiImpl.logRequestCost: builds a snapshot from the same counters it already logs, calls publisher on each tick. - Tests: lock the on-the-wire JSON shape (10 fields exact) and the enable gate. Cluster ID via ClusterFactory.getClusterId(); environment ID via ConfigUtils.getServerId(). Both wrapped in Try with "unknown" fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Claude finished @wezell's task in 2m 49s —— View job PR ReviewRead all five files plus [🟠 High] [🟡 Medium] Submitter rejection isn't handled in [🟡 Medium]
[🟡 Medium] final URI safe = new URI(uri.getScheme(), null, uri.getHost(), uri.getPort(),
uri.getPath(), null, null);[🟡 Medium] [🟢 Low / Nit] A few smaller items
Verdict |
- Align javadoc with the 10-minute fail-log throttle (was stated as "once per minute" but the constant is 600_000ms). - Change FAIL_LOG_INTERVAL_MS from long to int and drop the (int) casts — the value fits in int and Logger.warnEvery takes int, so the cast was theoretical overflow waiting to happen. - Sanitize the push URL before logging: strip RFC-3986 userinfo so a misconfigured REQUEST_COST_PUSH_URL=https://user:pass@host/... doesn't leak credentials into every failure log line. - Add tests for the sanitizer (with/without userinfo) and a regression test that Config.setProperty(key, null) actually unsets the property (so the enable gate can't silently flip on if Apache Commons ever stringifies null). 11/11 tests passing locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y, deterministic timestamps, header injection guard Addresses the second Claude AI review of PR #35749. No-blocker review, but the behavioral and operability items are worth landing: - Continuous time series. The skipZeroRequests gate now only suppresses the log line on consecutive idle windows; the publisher still emits a zero-snapshot every tick when enabled. This makes "idle cluster" and "agent down" graphable side-by-side on the collector. - Deterministic timestamp. Truncate to seconds (Instant.now().truncatedTo(SECONDS)) so the wire format is stable ISO-8601 seconds, matching the PR description and what most TSDBs index on. - Header injection guard. New sanitizeHeaderValue() strips CR/LF + trims before the bearer token hits the Authorization header. Tiny defense-in-depth — only an operator can set the token, but CRLF in a misconfigured value would otherwise be HTTP header injection. - Broaden @after cleanup to include REQUEST_COST_PUSH_TIMEOUT_MS so future tests don't bleed state. Tests: 12/12 passing locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…il log Third round of Claude AI review on PR #35749. The one merge-worthy item: - Gate disagreement bug. isEnabled() checked the raw token via isSet(), but post() sanitized (strip CRLF + trim) before putting it in the Authorization header. A token of " \r\n " would pass the gate and result in an *unauthenticated* POST to the collector — strictly worse than staying off. Move sanitization into the gate so this can't happen, and add a regression test for whitespace/CRLF-only tokens. - Clearer failure log. CircuitBreakerUrl can return without setting a response code (circuit open, transport error before response). Check isProcessed() first so we emit "did not complete" rather than "returned HTTP -1". 13/13 tests passing locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…questCost Third Claude review flagged that the four LongAdder reads (window then lifetime) aren't atomic relative to each other, so Σ(window) can briefly trail lifetime. That's intentional — atomic would need a lock and the data is observational. Document so the next reader doesn't "fix" it. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… keys Fourth Claude review on PR #35749. Two defense-in-depth items, no behavior contract changes: - Warn-once if REQUEST_COST_PUSH_URL uses http:// (not https://). The bearer token would otherwise traverse the wire in cleartext. Warn-rather-than-refuse so a misconfiguration doesn't silently drop telemetry — the operator sees the log line and fixes it. - Split Logger.warnEvery dedup keys into REQUEST_COST_PUSH_FAIL_TRANSPORT, REQUEST_COST_PUSH_FAIL_HTTP, REQUEST_COST_PUSH_ERR_EXCEPTION. Previously a 500 storm and a circuit trip shared one key, so the second mode was silently suppressed for the rest of the 10-minute window. Tests: 13/13 still passing locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fifth Claude review on PR #35749 (and the user's explicit call) flagged that the wire field environmentId was misleading: it was sourced from ConfigUtils.getServerId(), which is a per-JVM/per-node identifier — every node in a cluster sends a different value. Across Grafana that reads inverted. Rename to serverId so the wire contract matches the source data exactly: - clusterId = cluster - serverId = node Done before any collector consumes the payload, so the wire format is locked in correctly from the first emit. Tests: 13/13 still passing locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-check token on post Sixth Claude review on PR #35749. No blocker, but three real defensive fixes: - Clamp REQUEST_COST_DENOMINATOR to >= 1.0 at init. A misconfigured 0 would produce Infinity/NaN in the snapshot, which Jackson serializes as literals invalid under strict JSON parsers — collector-side breakage on the first emit. - Drop the explicit "Content-Type: application/json" header. CircuitBreakerUrl already sniffs rawData starting with '{' and applies the JSON content type itself; setting both can emit a duplicate header on some Apache HttpClient versions. - Re-check the sanitized token in post() as well as the URL. The config can be cleared between snapshot submission and the executor running the POST. Posting without an Authorization header is a worse failure mode than not posting — the collector sees authenticated-looking traffic that isn't. 13/13 tests passing locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…server ids Seventh Claude review on PR #35749. Two real defensive fixes, then I'm done chasing AI feedback — verdict has been "no blocker" since round 3 and the remaining items are diminishing-returns. - Snapshot Jackson @JsonAutoDetect: ANY -> PUBLIC_ONLY. All wire fields are already public final so behavior is identical, but the explicit lock makes it impossible to accidentally leak a future private field onto the wire. The "exactly 10 fields" test becomes belt-and-suspenders rather than the only line of defense. - nullSafe() coalesces null returns from ClusterFactory.getClusterId() / ConfigUtils.getServerId(). Try.getOrElse only fires on throw — null returns during early startup would otherwise have serialized "clusterId": null on the wire instead of the documented "unknown" fallback. 13/13 tests passing locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
So the multi-agent Claude backend reviewer fires on PRs I author and can submit a formal --approve / --request-changes (which the orchestrator's automatic-review path can't do). Touches .github/ which is owned by @dotCMS/dotdevelopers per CODEOWNERS — splitting from the feature commits so it's an obvious one-line audit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…owlist" This reverts commit aed6477.
🔍 dotCMS Backend Review[🟡 Medium]
if (scheme != null && scheme.equalsIgnoreCase("http")
&& httpSchemeWarned.compareAndSet(false, true)) {
Logger.warn(this.getClass(), "...bearer token will be sent in cleartext...");
}💡 Refuse to post when scheme is [🟡 Medium]
static String sanitizeHeaderValue(final String value) {
return value == null ? null : value.replace("\r", "").replace("\n", "").trim();
}💡 Strip the whole CTL range: return value == null ? null : value.replaceAll("[\\x00-\\x1F\\x7F]", "").trim();[🟡 Medium]
final URI safe = new URI(
uri.getScheme(), null, uri.getHost(), uri.getPort(),
uri.getPath(), uri.getQuery(), uri.getFragment());
return safe.toString();💡 Drop the query and fragment too — only scheme/host/port/path are needed for log disambiguation: final URI safe = new URI(uri.getScheme(), null, uri.getHost(), uri.getPort(),
uri.getPath(), null, null);Next steps
|
There was a problem hiding this comment.
Pull request overview
Adds opt-in, scheduled publishing of aggregated request-cost (“token”) telemetry to an external HTTP collector, reusing the existing RequestCostApiImpl monitor tick so cluster-wide usage can be tracked outside local logs.
Changes:
- Introduces
RequestCostSnapshotas a fixed-shape JSON payload for per-window and lifetime token aggregates. - Adds
RequestCostPublisherto POST snapshots asynchronously viaDotConcurrentFactoryandCircuitBreakerUrl, gated byREQUEST_COST_PUSH_URL+REQUEST_COST_PUSH_TOKEN. - Extends
RequestCostApiImpl#logRequestCost()to build/publish snapshots each tick while keeping console log throttling for consecutive idle windows; adds unit tests locking payload shape and enablement behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| dotCMS/src/main/java/com/dotcms/cost/RequestCostSnapshot.java | Defines the on-the-wire snapshot payload and locks JSON field visibility. |
| dotCMS/src/main/java/com/dotcms/cost/RequestCostPublisher.java | Implements async POST publishing with config gating and basic hardening/sanitization. |
| dotCMS/src/main/java/com/dotcms/cost/RequestCostApiImpl.java | Builds snapshot from existing counters and triggers publisher from the existing scheduler tick. |
| dotCMS/src/test/java/com/dotcms/cost/RequestCostSnapshotTest.java | Verifies snapshot JSON contains the expected fields/values and exactly 10 fields. |
| dotCMS/src/test/java/com/dotcms/cost/RequestCostPublisherTest.java | Verifies enablement gate, disabled no-op, and sanitization helpers. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Fixes #35750
Summary
RequestCostApiImplmonitor scheduler tick (default 60s, controlled byREQUEST_COST_TIME_WINDOW_SECONDS) — no new threads, no new schedulers.DotConcurrentFactory.getInstance().getSubmitter()so a slow or dead collector never blocks the monitor scheduler.REQUEST_COST_PUSH_URLandREQUEST_COST_PUSH_TOKENare set.Why
RequestCostApiImplalready aggregates request counts and token totals (window + lifetime). Today those numbers are only emitted to the local log — they aren't queryable across a cluster. This wires the same aggregates into an external time-series sink so per-cluster, per-environment token consumption can be tracked centrally.Wire format
One snapshot per tick (10 fields, exact):
{ "clusterId": "...", "serverId": "...", "timestamp": "2026-05-19T18:48:00Z" // ISO-8601, truncated to seconds, "windowSeconds": 60, "windowRequests": 1234, "windowTokens": 5678.5, "windowAvgTokensPerRequest": 4.6, "lifetimeRequests": 999999, "lifetimeTokens": 12345678.25, "lifetimeAvgTokensPerRequest": 12.35 }clusterId←ClusterFactory.getClusterId()serverId←ConfigUtils.getServerId()Trywith"unknown"fallback so a transient lookup failure doesn't drop the snapshot.timestampisInstant.now().toString()(ISO-8601, UTC).REQUEST_COST_DENOMINATOR(consistent with what gets logged).Config keys
REQUEST_COST_PUSH_URLREQUEST_COST_PUSH_TOKENREQUEST_COST_PUSH_TIMEOUT_MS5000Existing knobs (unchanged):
REQUEST_COST_ACCOUNTING_ENABLED— overall request-cost gatingREQUEST_COST_TIME_WINDOW_SECONDS— scheduler cadence (drives publish cadence too)REQUEST_COST_DENOMINATOR— scaling factorBehavior
Failure handling
Logger.warnEvery(..., 10 min)and the snapshot is dropped. This is observational telemetry, not durable accounting — buffering on disk would be over-engineered.publish()returns immediately; the actual POST happens on the shared submitter.Test plan
RequestCostSnapshotTest(3 tests) — locks JSON shape: every expected field present, all values round-trip correctly, exactly 10 fields on the wire (guards against accidental field leakage when someone adds a private helper later).RequestCostPublisherTest(9 tests) — enable gate: disabled by default, URL-alone is not enough, token-alone is not enough, both-set activates,publish()is a no-op when disabled../mvnw compile -pl :dotcms-core -DskipTests— BUILD SUCCESS../mvnw test -pl :dotcms-core -Dtest='RequestCostSnapshotTest,RequestCostPublisherTest'— 12/12 passing.REQUEST_COST_PUSH_URLat a local request bin, verify a snapshot lands within the window.Out of scope
CircuitBreakerUrlblocks private subnets via a staticLazy<Boolean>that capturesALLOW_ACCESS_TO_PRIVATE_SUBNETSon first access — that makes the test fragile to JVM/test ordering. The post path is mostly orchestration; serialization is fully covered by the snapshot test.Pricebreakdown in the payload (e.g.{DB_QUERY: N, ES_QUERY: M}). Would require additional counters inincrementCost; can land as a follow-up if the collector needs it.RequestCost; the new field names use "tokens" since that's the better term, but a package rename is a separate, larger refactor (24+ callers).Files
dotCMS/src/main/java/com/dotcms/cost/RequestCostSnapshot.java(new)dotCMS/src/main/java/com/dotcms/cost/RequestCostPublisher.java(new)dotCMS/src/main/java/com/dotcms/cost/RequestCostApiImpl.java(modified — adds snapshot construction + publisher call inlogRequestCost)dotCMS/src/test/java/com/dotcms/cost/RequestCostSnapshotTest.java(new)dotCMS/src/test/java/com/dotcms/cost/RequestCostPublisherTest.java(new)🤖 Generated with Claude Code