Skip to content

feat(data-pipeline)!: export client-computed span stats as OTLP trace metrics#2150

Open
mabdinur wants to merge 11 commits into
mainfrom
munir/otlp-trace-metrics
Open

feat(data-pipeline)!: export client-computed span stats as OTLP trace metrics#2150
mabdinur wants to merge 11 commits into
mainfrom
munir/otlp-trace-metrics

Conversation

@mabdinur

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds a new OTLP trace-metrics export path to the data pipeline. When the
SDK computes stats client-side, the span concentrator now flushes them as
traces.span.sdk.metrics.duration OTLP histograms in addition to the
existing agent /v0.6/stats payload.

Includes a fix so that grpc.method (/ rpc.method) is part of the
aggregation key rather than attached after the fact — spans with
different gRPC methods now get separate metric data points.

Motivation

The OTLP metrics path enables downstream consumers (e.g., OTel
spanmetrics-connector-compatible backends) to receive exact per-method
latency histograms without going through the Datadog agent.

Additional Notes

Breaking changes (this goes out in a major version):

  • FixedAggregationKey<T> gains a grpc_method: T field. Any code that
    constructs this struct by name (not ..Default::default()) must add the
    new field. The SHM concentrator's SHM_VERSION is bumped from 1 → 2 to
    prevent layout mismatches between old workers and a new sidecar; they
    will fail with a version-mismatch error rather than silently
    misinterpreting memory.

  • StatsBucket::insert no longer takes a grpc_method parameter.

The agent /v0.6/stats protobuf wire format (ClientGroupedStats) is
unchanged — there is no grpc_method field in the proto, so the new
key dimension is surfaced only through the OTLP path.

How to test the change?

Unit tests in libdd-trace-stats cover aggregation key extraction
(including grpc.method.name and rpc.method) and the OTLP exact-cell
flush path. Integration tests in libdd-data-pipeline exercise the
full exporter pipeline.

cargo nextest run -p libdd-trace-stats -p datadog-ipc -p libdd-data-pipeline

mabdinur and others added 11 commits June 15, 2026 16:46
Foundation pieces consumed by the OTLP trace-metrics exporter that follows.
These are pure additions with no breaking changes.

- libdd-ddsketch: `DDSketch::from_pb` rebuilds a sketch from its protobuf
  representation (or `None` when the mapping is missing/invalid); a thin
  `DDSketch::from_encoded` wraps protobuf decoding + `from_pb`. Lets callers
  read back the ok/error sketches that the span concentrator publishes.
  Includes a roundtrip test that goes `encode_to_vec` -> `from_encoded` and
  asserts bin count + total weight survive the trip.

- libdd-trace-utils: extend `OtlpResourceInfo` with two new fields:
  `hostname` (emitted as the `host.name` resource attribute when set) and
  `process_tags` (comma-separated `key:value` pairs, each becoming a
  `dd.<key>` resource attribute). The struct is `#[non_exhaustive]`, so
  adding fields is forward-compatible.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Make the span concentrator accumulate exact per-cell (ok/error) duration
totals and min/max in nanoseconds alongside the existing combined `duration`
that the /v0.6/stats agent payload uses, and publish them on a new sidecar
that the OTLP trace-metrics path can read.

- `GroupedStats` gains six pub(super) accumulators (`ok_duration`/`ok_min`/
  `ok_max` + the error trio) updated inside `insert`. They are seeded on the
  first span in each cell (count == 1) so the natural `0` default cannot
  masquerade as a real minimum.

- New public types `OtlpExactCell`, `OtlpExactGroup`, `OtlpStatsBucket` carry
  the exact scalars alongside an unmodified `pb::ClientStatsBucket`. The
  `grpc_method` field on `OtlpExactGroup` is intentionally introduced here
  but only ever populated with `String::new()`; a later commit fills it in.

- `StatsBucket::flush` now delegates to a new `flush_with_otlp_exact` which
  produces both the protobuf bucket (identical bytes) and the parallel
  sidecar. `SpanConcentrator::flush` and `flush_with_otlp_exact` share a
  generic `drain_due_buckets` helper so the bucket-window/buffer-len logic
  stays in one place.

- A new concentrator test drives the full path through `add_span` for 3 ok
  + 2 error spans and asserts each cell's count/duration/min/max plus
  `ok_duration + error_duration == group.duration` (the agent field).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nfig

Prepare the data-pipeline OTLP layer to host a second exporter (trace
metrics) without changing the existing trace path's behavior or public API.

- `otlp/exporter.rs`: factor the actual POST + retry plumbing into a new
  crate-private `send_otlp_http(endpoint_url, headers, timeout, ...)` helper.
  `send_otlp_traces_http` becomes a thin wrapper that pulls fields out of
  `OtlpTraceConfig` and calls it; the existing public function signature is
  unchanged, so external callers see no diff. Two new pub(crate) constants
  (`OTLP_MAX_ATTEMPTS`, `OTLP_SHUTDOWN_MAX_ATTEMPTS`) replace the previous
  `OTLP_MAX_RETRIES` literal so the trace-metrics worker can use a single
  attempt on shutdown.

- `otlp/config.rs`: add `OtlpMetricsConfig` mirroring `OtlpTraceConfig`
  plus an `otel_trace_semantics_enabled` flag for `DD_TRACE_OTEL_SEMANTICS_ENABLED`.
  Annotated `#[allow(dead_code)]` until a follow-up commit consumes it.

- `trace_exporter/builder.rs`: factor the inline OTLP header-map builder
  out of `build_async` into a small `build_otlp_header_map` helper and
  refactor the existing OTLP traces config building to use it. No behavior
  change; this dedup makes the metrics-config branch trivial when it lands.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…metrics

Wire up the actual OTLP trace-metrics exporter on top of the foundation
pieces from earlier commits.

- New `libdd-data-pipeline/src/otlp/metrics.rs`:
  - `map_stats_to_otlp_metrics` builds an `ExportMetricsServiceRequest`
    JSON value from `&[OtlpStatsBucket]` (one histogram data point per
    aggregation-key (ok|error) cell). `count`/`sum`/`min`/`max` come from
    the sidecar's exact accumulators (ns -> s); `bucketCounts` is projected
    from the per-cell DDSketch onto a fixed 17-bucket spanmetrics-style
    layout. Empty cells are suppressed.
  - `OtlpStatsExporter<C>` runs as a `libdd_shared_runtime::Worker`:
    `trigger` waits one flush interval, `run` flushes + sends with
    `OTLP_MAX_ATTEMPTS`, `shutdown` force-flushes with
    `OTLP_SHUTDOWN_MAX_ATTEMPTS` (single attempt) so the final bucket is
    delivered inside the bounded shutdown window.
  - The mapper consumes `exact.grpc_method` (always empty here) so the
    later breaking-change commit only has to fill it in.

- `otlp/mod.rs`: declare the new `metrics` module, re-export
  `OtlpMetricsConfig` and `OtlpStatsExporter`, and extend the module-level
  doc to describe the trace-metrics path.

- `trace_exporter/builder.rs`: add `otlp_metrics_endpoint`,
  `otlp_metrics_headers` and `otel_trace_semantics_enabled` fields with
  matching setters (`set_otlp_metrics_endpoint`, `set_otlp_metrics_headers`,
  `enable_otel_trace_semantics`). When both an OTLP metrics endpoint and a
  stats bucket size are configured, spawn an `OtlpStatsExporter` worker on
  the shared runtime against an unconditionally-started
  `SpanConcentrator`; set a new `otlp_stats_enabled` flag on `TraceExporter`
  so the agent-info gate cannot later disable stats. The agent /v0.6/stats
  payload bytes are unchanged when no OTLP metrics endpoint is set.

- `trace_exporter/mod.rs`: add the `otlp_stats_enabled` field on
  `TraceExporter`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add the gRPC method name to the aggregation key so spans sharing the same
service/resource/etc. but different `grpc.method.name` aggregate into
distinct groups, and surface the value via the OTLP trace-metrics sidecar
introduced earlier on this branch.

- `aggregation.rs`:
  - New `GRPC_METHOD_FIELD` lookup list (`grpc.method.name`, fallback
    `rpc.method`) consumed by a new `get_grpc_method` helper.
  - New `FixedAggregationKey<T>.grpc_method` field, appended at the END of
    the struct so the `PartialOrd` derive's field order (and therefore the
    ordering of any existing comparisons) is unaffected for the pre-existing
    fields.
  - `BorrowedAggregationKey::from_obfuscated_span` now picks up
    `grpc_method`; `OwnedAggregationKey::From<pb::ClientGroupedStats>` sets
    it to `""` (the agent stats protobuf does not carry it).
  - `StatsBucket::flush_with_otlp_exact` does `std::mem::take` on the key's
    `grpc_method` and moves it into `OtlpExactGroup.grpc_method` before
    encoding the agent payload, so the OTLP path reads it from the sidecar
    while the /v0.6/stats wire format stays byte-for-byte unchanged.
  - Aggregation test gains a case asserting that `grpc.method.name` (and
    by fallthrough, `rpc.method`) are extracted into the key.

- `datadog-ipc/src/shm_stats.rs`: the SHM concentrator's
  `FixedAggregationKey` test fixture grows a `grpc_method: ""` field.

BREAKING CHANGE: `FixedAggregationKey<T>` (re-exported from
`libdd_trace_stats::span_concentrator`) gains a public `grpc_method: T`
field. External callers that construct it via a struct literal must add
the field; callers using `Default::default()` are unaffected. The /v0.6/stats
agent protobuf wire format and behavior are unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n SDK computes stats

When otlp_stats_enabled, add _dd.stats_computed="true" to the OTLP ResourceSpans resource
attributes and Datadog-Client-Computed-Stats: yes to the HTTP request headers. The Agent's
OTLP receiver already checks both signals (otlp.go:372, otlp.go:272) and skips its
concentrator when either is set, preventing double-counted APM metrics.

The resource attribute survives Collector hops (unlike HTTP headers); the header covers direct
SDK→Agent connections. Both are backwards compatible: older Agents and non-Datadog OTLP
receivers silently ignore unknown resource attributes and headers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… negotiation with OTLP stats

grpc_method was part of FixedAggregationKey (Hash+PartialEq), splitting same-service gRPC spans
into separate buckets that encode_grouped_stats then serialised with an empty method — producing
duplicate indistinguishable ClientGroupedStats rows on the /v0.6/stats path. Move it to
GroupedStats (value side), set on group creation, and surface it to OtlpExactGroup from there.
This also removes the one breaking change introduced by the prior commit.

check_agent_info returned before refresh_v1_active when otlp_stats_enabled, preventing V1
protocol negotiation for exporters that combine enable_v1_protocol with OTLP metrics. Move the
early return to after the V1 refresh so only stats enable/disable is skipped.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… negotiation with OTLP stats

grpc_method was part of FixedAggregationKey (Hash+PartialEq), splitting same-service gRPC spans
into separate buckets that encode_grouped_stats then serialised with an empty method — producing
duplicate indistinguishable ClientGroupedStats rows on the /v0.6/stats path. Move it to
GroupedStats (value side), set on group creation, and surface it to OtlpExactGroup from there.
This also removes the one breaking change introduced by the prior commit.

check_agent_info returned before refresh_v1_active when otlp_stats_enabled, preventing V1
protocol negotiation for exporters that combine enable_v1_protocol with OTLP metrics. Move the
early return to after the V1 refresh so only stats enable/disable is skipped.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Rename OTLP_MAX_ATTEMPTS/OTLP_SHUTDOWN_MAX_ATTEMPTS to OTLP_MAX_RETRIES/
  OTLP_SHUTDOWN_MAX_RETRIES and rename the max_attempts parameter to
  max_retries throughout, converging on the retries convention used elsewhere
- Add TraceExporterBuilder::set_runtime_id so callers can supply the language
  tracer's existing runtime_id; falls back to a generated UUID when not set,
  ensuring OTLP trace exports and OTLP trace-metrics share the same runtime_id
  for backend correlation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Spans with different gRPC methods were previously merged into the same
stats group (only the first span's method was kept). Adding grpc_method
to FixedAggregationKey ensures each method gets a separate bucket.

The OtlpExactGroup.grpc_method field is now sourced from the key rather
than a GroupedStats sidecar. The agent /v0.6/stats protobuf wire format
is unchanged (no grpc_method field in ClientGroupedStats).

SHM_VERSION bumped to 2 because FixedAggregationKey<StringRef> is
#[repr(C)] and the new field changes the layout; mismatched sidecar/worker
pairs will safely fail with a version-mismatch error.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mabdinur mabdinur requested review from a team as code owners June 22, 2026 17:08
@mabdinur mabdinur requested review from vpellan and removed request for a team June 22, 2026 17:08

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6f244c9e7d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +374 to +375
if self.otlp_stats_enabled {
return;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply agent trace filters before skipping OTLP stats updates

When OTLP metrics are enabled and the agent /info response carries filter_tags, regex filters, or ignore_resources, this early return exits before installing the new TraceFilterer. process_traces_for_stats still runs with the OTLP concentrator enabled and uses self.trace_filterer.load(), so it keeps the empty filter config and exports metrics for traces that the agent config says should be rejected/ignored. Move this return below the trace-filter update (and state bookkeeping) and only skip the agent-driven stats enable/disable block.

Useful? React with 👍 / 👎.

@dd-octo-sts

dd-octo-sts Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Artifact Size Benchmark Report

aarch64-alpine-linux-musl
Artifact Baseline Commit Change
/aarch64-alpine-linux-musl/lib/libdatadog_profiling.so 7.76 MB 7.76 MB +0% (+48 B) 👌
/aarch64-alpine-linux-musl/lib/libdatadog_profiling.a 84.02 MB 84.56 MB +.64% (+554.47 KB) 🔍
aarch64-unknown-linux-gnu
Artifact Baseline Commit Change
/aarch64-unknown-linux-gnu/lib/libdatadog_profiling.so 10.36 MB 10.44 MB +.74% (+79.29 KB) 🔍
/aarch64-unknown-linux-gnu/lib/libdatadog_profiling.a 95.13 MB 95.71 MB +.61% (+597.64 KB) 🔍
libdatadog-x64-windows
Artifact Baseline Commit Change
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.dll 24.93 MB 25.11 MB +.73% (+186.50 KB) 🔍
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.lib 87.33 KB 87.33 KB 0% (0 B) 👌
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.pdb 181.51 MB 182.39 MB +.48% (+904.00 KB) 🔍
/libdatadog-x64-windows/debug/static/datadog_profiling_ffi.lib 928.21 MB 932.35 MB +.44% (+4.13 MB) 🔍
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.dll 8.12 MB 8.22 MB +1.25% (+104.50 KB) ⚠️
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.lib 87.33 KB 87.33 KB 0% (0 B) 👌
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.pdb 24.03 MB 24.20 MB +.71% (+176.00 KB) 🔍
/libdatadog-x64-windows/release/static/datadog_profiling_ffi.lib 47.96 MB 48.33 MB +.76% (+376.87 KB) 🔍
libdatadog-x86-windows
Artifact Baseline Commit Change
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.dll 21.62 MB 21.78 MB +.74% (+165.00 KB) 🔍
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.lib 88.71 KB 88.71 KB 0% (0 B) 👌
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.pdb 185.58 MB 186.50 MB +.49% (+944.00 KB) 🔍
/libdatadog-x86-windows/debug/static/datadog_profiling_ffi.lib 921.15 MB 925.38 MB +.45% (+4.22 MB) 🔍
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.dll 6.27 MB 6.33 MB +.98% (+63.50 KB) 🔍
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.lib 88.71 KB 88.71 KB 0% (0 B) 👌
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.pdb 25.76 MB 25.96 MB +.75% (+200.00 KB) 🔍
/libdatadog-x86-windows/release/static/datadog_profiling_ffi.lib 45.59 MB 45.94 MB +.75% (+354.41 KB) 🔍
x86_64-alpine-linux-musl
Artifact Baseline Commit Change
/x86_64-alpine-linux-musl/lib/libdatadog_profiling.a 74.91 MB 75.38 MB +.62% (+479.11 KB) 🔍
/x86_64-alpine-linux-musl/lib/libdatadog_profiling.so 8.61 MB 8.68 MB +.77% (+68.03 KB) 🔍
x86_64-unknown-linux-gnu
Artifact Baseline Commit Change
/x86_64-unknown-linux-gnu/lib/libdatadog_profiling.a 90.33 MB 90.84 MB +.55% (+514.57 KB) 🔍
/x86_64-unknown-linux-gnu/lib/libdatadog_profiling.so 10.48 MB 10.56 MB +.69% (+74.78 KB) 🔍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant