Skip to content

feat(telemetry): export per-status serve_shape request counter#4500

Merged
alco merged 6 commits into
mainfrom
alco/request-telemetry
Jun 9, 2026
Merged

feat(telemetry): export per-status serve_shape request counter#4500
alco merged 6 commits into
mainfrom
alco/request-telemetry

Conversation

@erik-the-implementer

@erik-the-implementer erik-the-implementer commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

What

Adds a new unsampled, per-request metric from the shape-serving path:

electric.plug.serve_shape.requests.count   tags: [status, known_error, live]

It hangs off the existing [:electric, :plug, :serve_shape] telemetry event (no new emission point) and threads a known_error flag into that event's metadata, read straight off the electric-internal-known-error response header.

Why

We sample shape-request spans aggressively (head sampling + tail overrides) to stay within our tracing event budget. That makes the span dataset great for drill-down but unreliable as a health signal:

  • It under-represents true volume (sampled), and
  • It drops some request classes entirely — admission-control rejections are marked known_error and are intentionally excluded from span export, so a request plot built from spans can look perfectly healthy while the system is actually shedding load under overload.

Admission rejection counts already exist as a metric (electric.admission_control.reject.count), but there was no general, status-dimensioned request counter. The existing serve_shape metrics also (a) drop live/long-poll requests and (b) aren't tagged by response status, so they can't express request mix or error rate.

This counter fills that gap:

  • Unsampled — every request is counted, so there's no detection floor. Errors and rejections are visible the moment they rise, not once they cross a sampling threshold.
  • Counts live requests too — unlike the other serve_shape.* metrics, which keep: live != true. Cheap to do as an aggregated metric (the reason we couldn't do it for spans doesn't apply).
  • Admission rejections appear inline as status=503, known_error=true. check_admission halts the conn, but a halt isn't a raise, so the halted conn still flows through emit_shape_telemetry/1 and gets counted.
  • known_error matches the wire signal — it's derived from the electric-internal-known-error response header, the same byte downstream consumers (e.g. the edge worker's tracing) key on, so the classification is consistent end to end.

Intended use: this becomes the authoritative "requests by status / error rate" dashboard panel and alert source, leaving the sampled span dataset for exemplar drill-down.

Changes

  • electric-telemetry: define counter("electric.plug.serve_shape.requests.count", tags: [:status, :known_error, :live]) against the existing event (explicit event_name/measurement to avoid colliding with the existing serve_shape.count).
  • sync-service: add known_error to the serve_shape event metadata, derived from the electric-internal-known-error response header. The header check lives next to the code that sets it so the expected values stay single-sourced.

Cardinality

Bounded: status (~6 codes) × known_error (2) × live (2), per stack.

Notes

  • One pre-existing gap is unchanged: the mid-stream re-raise path (Plug.Conn.chunk/2 raising after the response is committed) doesn't emit the serve_shape event, so those requests aren't counted here — same limitation that already affects every serve_shape.* metric.
  • No rollout/rollback concerns: additive metric only, no behavior change on the request path.

@codecov

codecov Bot commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.67%. Comparing base (7892079) to head (29a655d).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4500       +/-   ##
===========================================
+ Coverage   32.48%   56.67%   +24.19%     
===========================================
  Files         216      373      +157     
  Lines       18368    39654    +21286     
  Branches     6478    10977     +4499     
===========================================
+ Hits         5967    22475    +16508     
- Misses      12369    17108     +4739     
- Partials       32       71       +39     
Flag Coverage Δ
electric-telemetry 69.10% <ø> (?)
elixir 69.10% <ø> (?)
packages/agents 70.75% <ø> (?)
packages/agents-mcp 77.54% <ø> (?)
packages/agents-mobile 66.92% <ø> (ø)
packages/agents-runtime 79.99% <ø> (?)
packages/agents-server 74.16% <ø> (+0.26%) ⬆️
packages/agents-server-ui 6.21% <ø> (ø)
packages/electric-ax 46.42% <ø> (?)
packages/experimental 87.73% <ø> (?)
packages/react-hooks 86.48% <ø> (?)
packages/start 82.83% <ø> (?)
packages/typescript-client 91.83% <ø> (?)
packages/y-electric 56.05% <ø> (?)
typescript 56.49% <ø> (+24.00%) ⬆️
unit-tests 56.67% <ø> (+24.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude

claude Bot commented Jun 4, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

Adds an unsampled, per-status electric.plug.serve_shape.requests.count counter on the existing [:electric, :plug, :serve_shape] event, threading a known_error flag (read off the electric-internal-known-error response header) into the event metadata. Iteration 5 covers a single new commit (Trim the changelog entry) — the code, tests, and changeset frontmatter are byte-identical to iteration 4. Small, additive, low-risk. Ready to merge.

What's Working Well

  • Correctness verified end to end (unchanged). check_admission rejects with Api.Response.error(..., status: 503, known_error: true)Api.Response.send/2 writes electric-internal-known-error: truehalt(). A halt isn't a raise, so the halted conn still flows through super(opts) |> emit_shape_telemetry/1 and conn_has_known_error?(conn) reads ["true"], counting it as status=503, known_error=true (serve_shape_plug.ex:455, response.ex:396-398).
  • Single-sourced header values. conn_has_known_error?/1 sits directly below put_known_error_header/2 in response.ex:385-398, so reader and writer of the "true"/"false" byte stay co-located.
  • Test coverage locks in both paths. The two tests assert 200/known_error: false and admission-rejected 503/known_error: true, with the call_serve_shape_plug helper now reading max_concurrent_requests via Access.get/3 so the rejection test can inject %{initial: 0, existing: 0} without disturbing existing callers.
  • Changeset still valid. This iteration's only change trims the changelog body from a verbose multi-paragraph block to a single line; the frontmatter (@core/electric-telemetry, @core/sync-service, both patch) is unchanged and correct. The one-line summary is accurate and matches the tags: [:status, :known_error, :live] definition.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

  • conn_has_known_error?/1 still has no @spec (response.ex:396). Minor Elixir-convention point for a new public function; trivial one-liner, genuinely optional. Only carry-over item, non-blocking.

Issue Conformance

No linked issue — a minor warning per project convention; consider linking the dashboard/alerting work this metric enables. PR description is thorough and matches the code. The mid-stream re-raise gap (Plug.Conn.chunk/2 raising post-commit, so those requests aren't counted) remains honestly documented as a pre-existing limitation shared by all serve_shape.* metrics — not introduced here.

Note: codecov reports all modified lines covered and all tests passing on head 668626a; the changelog trim (29a655d) touches no executable code.

Previous Review Status

  • ✅ Comment wording in response.ex — final state correct ("true" or "false" (and absent when known_error is nil)).
  • ✅ Stale PR description — description matches conn_has_known_error?/1 co-location.
  • ✅ Missing test coverage for known_error metadata — resolved by the two new tests.
  • has_known_error_header?/1 exact-match brittleness — resolved by co-locating reader with writer.
  • ✅ Changelog entry trimmed to a single line this iteration — frontmatter and accuracy preserved.
  • ◻️ @spec on conn_has_known_error?/1 — optional, still open; non-blocking.

Nothing actionable remains. The PR is ready to merge.


Review iteration: 5 | 2026-06-08

@alco alco self-assigned this Jun 8, 2026
@alco alco force-pushed the alco/request-telemetry branch from d3bd19b to 668626a Compare June 8, 2026 13:47
@alco alco merged commit c025d2e into main Jun 9, 2026
72 checks passed
@alco alco deleted the alco/request-telemetry branch June 9, 2026 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants