Skip to content

fix(sync-service): don't suspend a consumer mid-transaction (#4501)#4503

Merged
alco merged 3 commits into
mainfrom
alco/fix-suspend-mid-txn
Jun 9, 2026
Merged

fix(sync-service): don't suspend a consumer mid-transaction (#4501)#4503
alco merged 3 commits into
mainfrom
alco/fix-suspend-mid-txn

Conversation

@erik-the-implementer

@erik-the-implementer erik-the-implementer commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #4501KeyError: key :consider_flushed? not found in: nil in Electric.Shapes.Consumer.process_txn_fragment/2.

A shape consumer could suspend (terminate to reclaim memory) on its idle timeout while still holding a pending_txn for an in-flight multi-fragment transaction. The producer's EventRouter tracks "this shape already saw the begin for the current xid" keyed by shape_handle, independently of consumer liveness — so when a later fragment of that transaction arrived, ConsumerRegistry started a fresh consumer and delivered a has_begin?: false fragment to it. The fresh consumer has pending_txn: nil, so process_txn_fragment/2 dereferenced nil at txn.consider_flushed?.

Fix

consumer_can_suspend?/1 now also requires is_nil(state.pending_txn). A consumer that is mid-transaction hibernates instead of suspending, and only suspends once the transaction completes and it goes idle again.

🤖 Generated with Claude Code

@codecov

codecov Bot commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.45%. Comparing base (7892079) to head (16b7617).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4503       +/-   ##
===========================================
+ Coverage   32.48%   56.45%   +23.97%     
===========================================
  Files         216      358      +142     
  Lines       18368    39081    +20713     
  Branches     6478    10976     +4498     
===========================================
+ Hits         5967    22064    +16097     
- Misses      12369    16946     +4577     
- Partials       32       71       +39     
Flag Coverage Δ
packages/agents 70.75% <ø> (?)
packages/agents-mcp 77.54% <ø> (?)
packages/agents-mobile 66.92% <ø> (ø)
packages/agents-runtime 79.99% <ø> (?)
packages/agents-server 73.95% <ø> (+0.04%) ⬆️
packages/agents-server-ui 6.21% <ø> (ø)
packages/electric-ax 46.42% <ø> (?)
packages/experimental 87.73% <ø> (?)
packages/react-hooks 86.48% <ø> (?)
packages/start 82.83% <ø> (?)
packages/typescript-client 91.83% <ø> (?)
packages/y-electric 56.05% <ø> (?)
typescript 56.45% <ø> (+23.97%) ⬆️
unit-tests 56.45% <ø> (+23.97%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude

claude Bot commented Jun 4, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

This PR fixes the production crash KeyError: key :consider_flushed? not found in: nil in Electric.Shapes.Consumer.process_txn_fragment/2 (#4501) by refusing to suspend a consumer that still holds a pending_txn for an in-flight multi-fragment transaction. The fix is minimal, correctly placed, and well tested. Since iteration 1 the changeset entry was trimmed and the test's assert_receive/refute_receive timings were tuned — the functional change is unchanged and still correct.

What's Working Well

  • The guard is in exactly the right place. consumer_can_suspend?/1 now also requires is_nil(state.pending_txn) (consumer.ex:418-420). pending_txn is the canonical "mid-transaction" indicator — set on any begin-without-commit (consumer.ex:522-529) and cleared on every completion path (consumer.ex:586,602,751,760) — so it transparently protects both the streaming path and the write_unit=txn buffering path.
  • Hibernate-instead-of-suspend trade-off is sound. Falling through to :hibernate keeps the process alive with a compacted heap; the next handle_event re-arms the idle timer, so the consumer suspends naturally once the txn completes. No risk of a permanently un-reclaimable process.
  • Good test-timing tweak. Widening refute_receive ... :suspend from 100ms to 400ms (consumer_test.exs:1501) strengthens the negative assertion: with hibernate_after: 10, a regressed (suspend-prone) build would terminate well within 400ms, so the wider window makes a false pass less likely. Dropping the explicit 1_000 on the final suspend assert_receive (consumer_test.exs:1529) is fine — it runs only after assert_receive {:flush_boundary_updated, 300}, 1_000 has already drained the work, so the consumer is idle and suspends within the default window.
  • Changeset present, correctly scoped (@core/sync-service, patch), and now appropriately concise.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

Defense in depth: the nil.consider_flushed? crash is still reachable on any other mid-transaction consumer death (carried over from iteration 1 — now confirmed reachable).

File: packages/sync-service/lib/electric/shapes/consumer.ex:562-568

In iteration 1 I flagged this but left open whether a consumer crash re-emits the begin (which would make it unreachable). I have now traced it, and it does not:

  • EventRouter.shapes_seen_begin is reset only on transaction begin (event_router.ex:114-126) or end (event_router.ex:128-144), and per-shape only via an explicit remove_shape (event_router.ex:50-59). A consumer crash + supervised restart triggers none of these — the shape stays in shapes_seen_begin.
  • The collector explicitly treats "a consumer processes fragment 1 but crashes on fragment 2" as a real, handled scenario for flush tracking (shape_log_collector.ex:601-607), confirming mid-transaction consumer death is expected.
  • A restarted consumer comes back with pending_txn: nil (consumer/state.ex default, never recovered on init). The next fragment for that xid carries has_begin?: false, so process_txn_fragment/2 matches %State{pending_txn: txn} with txn = nil and dereferences nil at txn.consider_flushed? (consumer.ex:568). Since the commit fragment is also has_begin?: false, the restarted consumer crashes on every remaining fragment until the txn completes — a crash loop.

This PR correctly closes the most common trigger (idle-timeout suspend), and that is a legitimate, well-targeted fix for #4501. But the underlying fragility is pre-existing and remains for any non-suspend mid-txn death (transient storage error, supervisor restart, etc.). It is reasonable to keep this PR narrow and address the class separately, but it should not be lost — consider a follow-up that adds a defensive process_txn_fragment/2 head matching %State{pending_txn: nil} with a non-begin fragment that logs and skips/recovers instead of crashing the supervision tree.

Suggestions (Nice to Have)

  • The test exercises the default (streaming) write_unit. Since the guard also protects the write_unit=txn buffering path, a sibling assertion or a one-line note documenting that broader intent would be nice. Optional.

Issue Conformance

Directly addresses #4501. The linked issue is a Sentry crash report with a clear stacktrace at consumer.ex:567/196; the fix targets exactly that dereference path. No scope creep. Issue is adequately specified for a crash fix.

Previous Review Status

Iteration 1 → 2:

  • ✅ Changeset entry trimmed (9361605).
  • ✅ Test timings tweaked (16b7617) — a net improvement (see above).
  • ⏳ The defense-in-depth concern from iteration 1 is unaddressed and now confirmed reachable (details above). Acceptable to defer to a follow-up given the narrow scope of this fix, but worth tracking.

The functional fix itself is unchanged and remains correct — this is a ✅ to merge for resolving #4501, with the defense-in-depth item recommended as a follow-up.


Review iteration: 2 | 2026-06-08

alco and others added 3 commits June 8, 2026 15:58
A consumer could suspend on its idle timeout while holding a pending_txn
for an in-flight multi-fragment transaction, dropping that state. When a
later fragment of the transaction arrived, a fresh consumer received a
has_begin?: false fragment with pending_txn=nil and crashed in
process_txn_fragment/2 (KeyError on :consider_flushed?).

Guard consumer_can_suspend?/1 on is_nil(pending_txn) so the consumer
hibernates instead and suspends only once the transaction completes.

Fixes #4501

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alco alco force-pushed the alco/fix-suspend-mid-txn branch from e5903b3 to 16b7617 Compare June 8, 2026 14:03
@alco alco marked this pull request as ready for review June 8, 2026 14:04
@alco alco merged commit 9dfa0eb into main Jun 9, 2026
69 of 71 checks passed
@alco alco deleted the alco/fix-suspend-mid-txn branch June 9, 2026 09:48
alco added a commit that referenced this pull request Jun 9, 2026
## Summary

Adds a `total_processing_time` attribute to the
`pg_txn.replication_client.transaction_received` span, set on the commit
fragment. It records the **wall-clock time taken to process all
fragments of a single transaction** — from when the begin was received
to when the commit fragment finishes processing.

Today our spans only measure per-fragment *processing* time (~ms). They
can't tell us how long a transaction's fragments are smeared across in
wall-clock terms — which is the quantity that determines whether a shape
consumer can idle past its suspend threshold mid-transaction (see #4501
/ #4503).

Unlike `receive_lag` — which is anchored on the Postgres commit
timestamp and measures end-to-end delivery lag, from when Postgres
committed the transaction to when Electric finished processing it —
`total_processing_time` is anchored entirely within Electric: it spans
receipt of the begin fragment to completion of the commit fragment.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Oleksii Sholik <oleksii@sholik.dev>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KeyError: key :consider_flushed? not found in: nil

3 participants