Skip to content

feat(sync-service): record per-transaction fragment wall-time#4504

Merged
alco merged 12 commits into
mainfrom
alco/txn-wall-time-metric
Jun 9, 2026
Merged

feat(sync-service): record per-transaction fragment wall-time#4504
alco merged 12 commits into
mainfrom
alco/txn-wall-time-metric

Conversation

@erik-the-implementer

@erik-the-implementer erik-the-implementer commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a total_processing_time attribute to the pg_txn.replication_client.transaction_received span, set on the commit fragment. It records the wall-clock time taken to process all fragments of a single transaction — from when the begin was received to when the commit fragment finishes processing.

Today our spans only measure per-fragment processing time (~ms). They can't tell us how long a transaction's fragments are smeared across in wall-clock terms — which is the quantity that determines whether a shape consumer can idle past its suspend threshold mid-transaction (see #4501 / #4503).

Unlike receive_lag — which is anchored on the Postgres commit timestamp and measures end-to-end delivery lag, from when Postgres committed the transaction to when Electric finished processing it — total_processing_time is anchored entirely within Electric: it spans receipt of the begin fragment to completion of the commit fragment.

🤖 Generated with Claude Code

@claude

claude Bot commented Jun 4, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

Iteration 9. One new commit since iteration 8 — 4cdafc6 ("Verify total_processing_time span attr export in otel-export.lux"). It's a test-only change to the otel-export.lux integration test that asserts the new total_processing_time attribute survives end-to-end through the OTLP exporter. No production code touched. The PR remains ready to merge.

What's Working Well

  • The export verification is the right gap to close. Prior iterations proved the attribute is set (unit tests on shape_log_collector_test.exs); this commit proves it's actually exported through the real OTLP pipeline — a dimension the unit tests can't reach.
  • The sampling rationale is correct and necessary. The pg_txn.replication_client.transaction_received span is the only span gated by the custom sampler (telemetry/sampler.ex:17sample?/0), which defaults to otel_sampling_ratio: 0.01 (config.ex:85). Without forcing ELECTRIC_OTEL_SAMPLING_RATIO=1.0, the single test transaction (the insert into items at otel-export.lux:51) would be sampled ~1% of the time and the assertion would flake almost always. Forcing 1.0 makes it deterministic. Verified the env override only adds spans — the other span/metric assertions in the test are ungated (included?(_) -> true), so raising the sampling ratio can't disturb them.
  • Regex tolerates the millisecond-truncation case. total_processing_time=[0-9]+ matches 0, which is what a single-fragment txn will almost certainly produce (begin→commit-processed in well under 1 ms truncates to 0 per shape_log_collector.ex:498-499). So the assertion is robust even though the value it captures is likely 0. Matches the format convention of the existing sibling assertions (Span … .*attr=…).

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

Carry-overs from prior iterations, all non-blocking and untouched here:

  1. No unit suffix on total_processing_time (shape_log_collector.ex:498). Sibling timing attrs use …_µs; a _ms suffix would make the unit legible in dashboards — and would make the new lux assertion self-documenting.
  2. Millisecond truncation (shape_log_collector.ex:499). Because the value truncates to 0 for small txns, the new export assertion proves presence but not a meaningful value — it would pass on total_processing_time=0. Recording microseconds would both preserve signal and let the test assert something stronger. Minor.
  3. Name vs semantics: total_processing_time includes inter-fragment idle gaps; total_wall_time reads truer.

Issue Conformance

Still no linked issue (#4501 / #4503 are motivating context only) — minor, per the standing convention. This commit matches the PR's stated intent (verify the metric is exported) with no scope creep. Changeset present (@core/sync-service, patch); test-only commit needs none of its own.

Previous Review Status

  • Critical (stranded fixture / red CI) — resolved in iterations 6–7.
  • ✅ Production correctness (telemetry attribute, span placement, _mono rename) — unchanged and previously confirmed sound.
  • New this iteration: end-to-end export of the attribute now has integration coverage.
  • ↩️ Suggestions Working with Postgres WAL format from Elixir #1Postgres replication source TCP server #3 (unit suffix, µs granularity, naming) remain open — all non-blocking.

Note: I couldn't run the lux suite here (no Docker/mix env in this sandbox). Conclusions are from static verification — confirmed the span is sampler-gated at 1% by default, that the test issues a committing transaction, and that the regex matches the emitted integer-ms attribute including the 0 case. Worth a glance at the integration-test job on the merge head to confirm green.

Monorepo / Cross-Package Notes

sync-service-internal only; no HTTP-contract or TypeScript-client impact.


Review iteration: 9 | 2026-06-08

@codecov

codecov Bot commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.46%. Comparing base (7892079) to head (4cdafc6).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4504       +/-   ##
===========================================
+ Coverage   32.48%   56.46%   +23.97%     
===========================================
  Files         216      358      +142     
  Lines       18368    39081    +20713     
  Branches     6478    10978     +4500     
===========================================
+ Hits         5967    22066    +16099     
- Misses      12369    16944     +4575     
- Partials       32       71       +39     
Flag Coverage Δ
packages/agents 70.75% <ø> (?)
packages/agents-mcp 77.54% <ø> (?)
packages/agents-mobile 66.92% <ø> (ø)
packages/agents-runtime 79.99% <ø> (?)
packages/agents-server 73.98% <ø> (+0.07%) ⬆️
packages/agents-server-ui 6.21% <ø> (ø)
packages/electric-ax 46.42% <ø> (?)
packages/experimental 87.73% <ø> (?)
packages/react-hooks 86.48% <ø> (?)
packages/start 82.83% <ø> (?)
packages/typescript-client 91.83% <ø> (?)
packages/y-electric 56.05% <ø> (?)
typescript 56.46% <ø> (+23.97%) ⬆️
unit-tests 56.46% <ø> (+23.97%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@netlify

netlify Bot commented Jun 5, 2026

Copy link
Copy Markdown

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit 6f54a60
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/6a26cf26c94d4000081bf072
😎 Deploy Preview https://deploy-preview-4504--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@alco alco self-assigned this Jun 8, 2026
alco and others added 9 commits June 8, 2026 16:05
Add a pg_txn.fragments_wall_duration_µs attribute to the
pg_txn.replication_client.transaction_received span, set on the commit
fragment. It measures the wall-clock time from a transaction's begin to
its commit as received from Postgres.

Because the replication stream is consumed on demand (e.g. paused while
database connections are scaled down), this includes idle gaps between
fragments and can be far larger than the per-fragment processing time —
it's the signal for transactions whose fragments span a shape consumer's
suspend threshold.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ll-time

tx_started_at_mono is set together with txn_fragment on Begin, so a
commit always has it; a begin-less commit would raise on the
`%{fragment | commit}` map-update regardless. Compute the duration
inline at commit and read it directly in the ShapeLogCollector,
removing the defensive nil branches and the misleading "nil after
reconnect" comment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gment processing

Carry the begin monotonic time on the commit and compute the wall-clock duration in the ShapeLogCollector after the commit fragment is processed, so processing time is included. Mirrors the existing receive_lag pattern (stored mono time + delta computed later, reported in ms).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…xtures

Changes.Commit is only built by MessageConverter, which always sets tx_started_at on Begin (a begin-less commit raises on the fragment map-update before exiting), so in regular execution the field is always present. Drop the is_integer guard and stamp tx_started_at on synthetic test commits instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lti-shape txn

Drives a single complete transaction that the EventRouter reslices to two shapes, and asserts the total_processing_time span attribute lands exactly once (on the original incoming commit), guarding against future EventRouter changes that might set it per-shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The local transaction/3 helper built a bare %Changes.Commit{}, leaving
tx_started_at nil. ShapeLogCollector computes total_processing_time as
now - tx_started_at on every commit fragment, which raised ArithmeticError
on nil and crashed the collector, failing the restore-latest-offset test.

In production a commit always carries tx_started_at (MessageConverter stamps
it on Begin and copies it onto Commit), so the fix is to honor that invariant
in the test helper, matching Support.TestUtils.txn_fragment/4.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alco alco force-pushed the alco/txn-wall-time-metric branch from 67243d1 to 6f54a60 Compare June 8, 2026 14:18
alco and others added 3 commits June 8, 2026 16:27
…ragment/3

Both shape_cache_test and shape_log_collector_test carried their own copy of
a local transaction/3 helper that built a complete single-fragment transaction.
The two copies had already drifted: one stamped tx_started_at on the commit,
the other didn't — which is what produced the ArithmeticError in
shape_cache_test once ShapeLogCollector started computing total_processing_time.

Replace both with the existing shared Support.TestUtils.complete_txn_fragment/3
(already used throughout consumer_test). To support the callers that build an
empty transaction, txn_fragment/4 now derives last_log_offset from the lsn when
there are no changes instead of pattern-matching a non-empty list.

This removes the duplication that allowed the helpers to fall out of sync.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Assert that the pg_txn.replication_client.transaction_received span carries
the new total_processing_time attribute end-to-end through the OTLP exporter.

This span is the only one gated by ELECTRIC_OTEL_SAMPLING_RATIO (1% by
default), so the setup env now forces full sampling (1.0) to make the span
deterministically exported; otherwise the assertion would almost never match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alco alco marked this pull request as ready for review June 8, 2026 15:12
@alco alco merged commit c015163 into main Jun 9, 2026
75 of 77 checks passed
@alco alco deleted the alco/txn-wall-time-metric branch June 9, 2026 09:48
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

This PR has been released! 🚀

The following packages include changes from this PR:

  • @core/sync-service@1.6.10

Thanks for contributing to Electric!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants