Tighten tracer.metrics defaults to protect tight-heap JVMs#11500
Tighten tracer.metrics defaults to protect tight-heap JVMs#11500dougqh wants to merge 6 commits into
Conversation
This comment has been minimized.
This comment has been minimized.
🟢 Java Benchmark SLOs — All performance SLOs passed
PR vs. master results
Commit: Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion. |
Cut the implicit TRACER_METRICS_MAX_PENDING default from 2048 (logical) to 128 on normal heap and to 64 at Xmx < 128 MB, and the implicit TRACER_METRICS_MAX_AGGREGATES default from 2048 to 256 at tight heap. Why --- The metrics inbox is an MpscArrayQueue<SpanSnapshot> sized to maxPending * LEGACY_BATCH_SIZE (64). With one ~120 B SpanSnapshot per slot, the prior 131072-slot default pinned ~15 MB worst-case in-flight when the aggregator stalled. At Xmx <= ~128 MB the G1 survivor region is too small to absorb that footprint -- observed catastrophically at Xmx 64 MB on spring-petclinic where the inbox overflowed young gen and triggered To-space Exhausted → Full GC storms (0 r/s in the worst case). New defaults bound the worst-case in-flight footprint at ~1 MB on normal heap and ~500 KB at tight heap, comfortably below typical survivor sizes and large enough to absorb the sub-second consumer stalls we actually see in practice (~0.8 s of buffer at 10 K spans/s on the normal-heap default). Customers who explicitly configure TRACER_METRICS_MAX_PENDING are unaffected; the LEGACY_BATCH_SIZE multiplier still applies to overrides. Only the implicit defaults shrink. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
391704f to
a9acd8d
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a9acd8d1d5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| final int defaultMaxAggregates = tightHeap ? 256 : 2048; | ||
| final int defaultMaxPending = tightHeap ? 64 : 128; |
There was a problem hiding this comment.
Sync metadata with the new tracer-metrics defaults
Changing the implicit default here leaves metadata/supported-configurations.json advertising DD_TRACE_TRACER_METRICS_MAX_PENDING and DD_TRACE_TRACER_METRICS_MAX_AGGREGATES as 2048. That file is the source used for supported-configuration metadata/docs, so users and config-inversion tooling will still see the old defaults even though normal heaps now get 128 pending and tight heaps get 64/256. Please update the metadata entry (or otherwise represent the heap-dependent default) along with this runtime change.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
From Claude...
Thanks — pushed 9ab5e58e4e updating
DD_TRACE_TRACER_METRICS_MAX_PENDINGfrom2048→128to match the new normal-heap default.
DD_TRACE_TRACER_METRICS_MAX_AGGREGATESis left at2048: that's still the normal-heap default (only the tight-heap branch changes it to256). The current schema is{version, type, default, aliases}with a single stringdefault— there's no way to encode "heap-dependent default" without a schema extension (e.g. adefaultExpressionfield ordefaults: [{when, value}]shape). So both keys' tight-heap branches stay unrepresented in metadata; the typical customer (Xmx ≥ 128 MB) sees the documented value.If a schema extension is in scope, happy to take that as a follow-up.
The default cut from 2048 → 128 needs the matching entry in metadata/supported-configurations.json so config-inversion tooling and supported-configuration docs reflect the new value. DD_TRACE_TRACER_METRICS_MAX_AGGREGATES is left at 2048: the normal-heap default is unchanged. The metadata schema doesn't support heap-dependent defaults, so the tight-heap branch (64 / 256) isn't representable; the metadata reflects the normal-heap default that applies to the typical customer (Xmx >= 128 MB). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Default changed from 2048 to 128; version field must be incremented when the default value changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
The merge request has been interrupted because the build 6485393595883034246 took longer than expected. The current limit for the base branch 'master' is 120 minutes. |
What Does This Do
Cut the implicit
tracer.metrics.max.pendingdefault from 2048 (logical) to 128 on normal heap and to 64 at Xmx<128MB, and the implicittracer.metrics.max.aggregatesdefault from 2048 to 256 at tight heap. Customers who explicitly configured either property keep their value.Motivation
The metrics inbox is an
MpscArrayQueue<SpanSnapshot>sized tomaxPending * LEGACY_BATCH_SIZE(64). With one ~120 BSpanSnapshotper slot, the prior 131,072-slot default pinned ~15 MB worst-case in-flight when the aggregator stalled.At Xmx ≤ ~128 MB the G1 survivor region is too small to absorb that footprint. Observed catastrophically at Xmx 64m on spring-petclinic —
SpanSnapshots overflow young gen and trigger To-space Exhausted → Full GC storms (0 r/s in the worst case).JFR allocation profile at Xmx 64 m attributes this to
SpanSnapshotbeing the #2 datadog allocator (~280 MB sampled bytes over 90 s) since #11381 introduced the producer/consumer split. The inbox amplifies the per-publish allocation into a heap-pressure problem only at tight heap.New defaults
Both are large enough to absorb the sub-second aggregator stalls we observe in practice (~0.8 s of buffer at 10 K spans/s on the normal-heap default).
What this is not
MpscArrayQueue<SpanSnapshot>.SpanSnapshotper metrics-eligible span.onStatsInboxFull.It's purely a bound on the inbox's worst-case footprint, sized for the survivor-region constraint that #11381's per-span allocation pattern made load-bearing.
Validation
Petclinic load test (Java 17, 8 jmeter threads,
GET /owners/3), cooled bench with 60s between runs, n=2 per heap.vs 1.62 baseline
† At Xmx ≥ 128 MB the
maxAggregatesdefault is unchanged and themaxPendingchange only reduces worst-case in-flight memory (rarely binding at large heap). 192m / 256m measurements above are from same-day runs of the merged-master baseline (#11382 tip) and apply unchanged to #11500.vs merged master baseline (the gain this PR contributes)
The +3.1% at 64m is the targeted improvement: the smaller inbox keeps the
SpanSnapshotfootprint inside G1 survivor regions at tight heap. At 96m the original default never bound, so the PR is a wash there.Variance behavior at tight heap
During the same thermal window, 1.62 hit a To-space Exhausted variance bomb at 96m (one iter collapsed to 2,496 r/s):
#11500 sustained throughput. This is the catastrophic failure mode the PR is designed to soften — without it, the bomb is observable at production Xmx ≤ ~128 MB.
Test plan
:internal-api:test—ConfigTest*passes:dd-trace-core:test—datadog.trace.common.metrics.*passes (92/92 locally)🤖 Generated with Claude Code