Skip to content

persist-committer: in-envd consensus committer service#36762

Draft
antiguru wants to merge 53 commits into
MaterializeInc:mainfrom
antiguru:persist-committer-design
Draft

persist-committer: in-envd consensus committer service#36762
antiguru wants to merge 53 commits into
MaterializeInc:mainfrom
antiguru:persist-committer-design

Conversation

@antiguru
Copy link
Copy Markdown
Member

Introduces a centralized persist committer service hosted by environmentd.

Each environment funnels all persist consensus traffic through a single CockroachDB connection pool instead of every clusterd opening its own.
A new mz-persist-committer crate exposes the Consensus trait over gRPC, backed by a monotonic in-memory ShardCache.
Clusterds and envd itself implement Consensus via a new RpcConsensus that targets the committer.
Routing is gated behind the persist_consensus_use_committer LD flag (default off).

Design doc: doc/developer/design/20260527_persist_committer.md.
Plan: docs/superpowers/plans/2026-05-27-persist-committer.md.

Status: draft.
Open before merge: cross-crate integration test, full metric instrumentation, validation that set_committer_channel ordering versus open_consensus is safe under all production startup paths, and a security review of accepting arbitrary persist consensus traffic over the committer socket.

@antiguru antiguru force-pushed the persist-committer-design branch 3 times, most recently from 3e2222a to c61e13b Compare May 29, 2026 02:35
antiguru and others added 27 commits May 30, 2026 07:21
Proposes an envd-hosted persist committer that funnels all consensus
traffic from clusterds through a single CRDB connection pool, replacing
the current per-process pool. Includes monotonic-cache rules for the
multi-writer (failover/rollout) case, rollout flags, and metrics plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* clusterd discovers committer via --persist-committer-url clap arg
* committer listens on dedicated new port (default 6878)
* in-process transport uses tonic in-memory channel for envd self-use

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Skeleton crate with empty submodules. Functionality lands in subsequent commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the proto definition for Head/Scan/CompareAndSet/Truncate/Subscribe
and wires the prost+tonic build script for the crate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ShardCache: BTreeMap-backed per-shard state with monotonic insert and
LRU eviction; subscribers pin entries. SubscriberRegistry: per-shard
bounded broadcast over tokio channels.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The real Consensus::compare_and_set takes no `expected` argument (it is
implicit in new.seqno - 1) and CaSResult does not carry current state on
mismatch. Drop the expected/mismatch-current proto fields, add ListKeys
(admin path), and make truncate's deleted count optional. Design doc now
specifies a background head() refresh on ExpectationMismatch instead of
the impossible cache-from-CaS-response update.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PersistCommitter wraps a backing Consensus, serves Head from the cache on
hit and populates it on miss, forwards Scan/Truncate unchanged, and on
CaS Committed updates the cache plus broadcasts to subscribers. On CaS
ExpectationMismatch the committer fires a background head() to refresh
the cache since the underlying trait does not return current state on
mismatch. list_keys collects the underlying stream into a vec for the
gRPC reply. Adds Debug derive to SubscriberRegistry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PersistCommitter now implements ProtoPersistConsensus, marshalling
Head/Scan/CompareAndSet/Truncate/ListKeys plus the Subscribe
server-streaming RPC over a tokio broadcast bridge. Adds async-stream and
tokio-stream deps. Subscribe sends the current cached snapshot first,
then forwards subsequent diffs; lagged subscribers re-sync via Head.

The refresh task periodically re-reads every subscribed shard from the
backing Consensus, monotonic-merges into the cache, and broadcasts a
diff if the seqno advanced. Defends against silent writes from another
writer during failover or rollout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CommitterMetrics holds the standard cache_hits/misses, rpc counters and
latency histograms, plus inflight + cached-shards gauges. Wired through
PersistCommitter; head_inner records cache hits/misses. Counter and
gauge instrumentation in the gRPC handlers themselves lands later.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RpcConsensus wraps a tonic Channel and implements the Consensus trait
over the ProtoPersistConsensus gRPC service: list_keys collects the
server's response into a small futures::Stream, head/scan/compare_and_set/
truncate marshal proto requests, and CaSResult is reconstituted from
the bool reply field (the committer's gRPC response intentionally does
not carry current state on mismatch).

To break a would-be circular dep, mz-persist-committer drops its (unused)
mz-persist-client dependency and mz-persist-client now depends on
mz-persist-committer for the generated proto types.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PersistClientCache gains an optional committer Channel set via a
set_committer_channel helper. When persist_consensus_use_committer is
true and a channel is present, open_consensus returns an RpcConsensus
wrapped in MetricsConsensus + Tasked just like the Postgres path; the
flag default is false so existing callers see no behavior change. Adds
the four committer LD flags to the dyncfg registry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the optional URL is supplied (CLI or PERSIST_COMMITTER_URL env), the
clusterd builds a lazy tonic channel and registers it on the
PersistClientCache. The persist_consensus_use_committer LD flag controls
whether PersistClientCache actually routes through it; without the flag
the channel is ignored.

Orchestrator-side argv plumbing is intentionally deferred: mzcompose and
local runs can set the URL via env var.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds mz_persist_committer::start_committer that bundles cache + registry
+ refresh + gRPC server behind one entry point and returns a lazy
loopback Channel suitable for set_committer_channel. envd opens its own
PostgresConsensus via ConsensusConfig::try_from + open, hands it to
start_committer, then registers the loopback channel on the
PersistClientCache.

The committer always runs; whether envd actually routes its own traffic
through it is controlled by the persist_consensus_use_committer LD flag
(default off). The listen address is configurable via the new
--internal-persist-committer-listen-addr flag (default 127.0.0.1:6878).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* New test/persist-committer/ mzcompose composition runs Materialize with
  persist_consensus_use_committer enabled, exercises CaS via SQL, and
  asserts that shard state survives an envd kill+restart.
* New PersistCommitter platform check verifies the same shard survives
  the platform-checks restart/upgrade matrix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reformatted the proto with buf and the Python check with black/ruff.
Pulls in a Cargo.lock update that was stale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ControllerConfig.persist_committer_url is a required String matching
  the persist_pubsub_url pattern; defaults to http://localhost:6880
  in envd's clap arg.
* Process orchestrator (controller/clusters.rs) appends
  --persist-committer-url=<url> to clusterd argv unconditionally.
* Kubernetes orchestratord creates a per-generation headless Service
  named `mz<id>-persist-committer-<generation>` selecting the envd
  statefulset, on the new --environmentd-internal-persist-committer-port
  (default 6880). envd's argv now includes both --persist-committer-url
  pointing at that service and --internal-persist-committer-listen-addr.
* Committer port moved from 6878 (collides with internal_http_port) to
  6880.
* Smoke composition wired into ci/test/pipeline.template.yml under
  build-aarch64.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Use mz_ore::test(tokio::test) instead of #[tokio::test].
* Use Arc::clone(&x) instead of x.clone() on ref-counted pointers.
* Replace `as` casts with try_from in tests; use usize::from for bool->int.
* Allow clippy::clone_on_ref_ptr and as_conversions on the prost/tonic
  generated proto module.
* Rephrase ProtoTruncateResponse.deleted doc comment to avoid generating
  an unclosed-HTML-tag rustdoc warning.
* mzcompose.py workflow_default now loops over workflows with test_case.
* Register the four new persist_committer LD flags in
  parallel_workload's FlipFlagsAction and in
  mzcompose UNINTERESTING_SYSTEM_PARAMETERS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For staging exercise only. Revert before merging the PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6880 collides with the SQL password listener defined in
src/materialized/ci/listener_configs/testdrive.json (and friends).
Move the committer to 6882 to sit outside the 6875-6881 range already
used by listeners and testdrive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Routing envd's own persist traffic through tonic on the same runtime as
the committer's gRPC server deadlocks the runtime under load (observed in
pgwire and sqllogictest suites). Mirror the pubsub
new_same_process_connection pattern: the committer exposes an
InProcessConsensus that dispatches directly into PersistCommitter::*_inner,
and start_committer returns it on CommitterHandle alongside the existing
loopback Channel.

envd now configures the PersistClientCache with the in-process consensus
via the new set_committer_in_process helper, which takes precedence over
any committer Channel. Clusterds still use RpcConsensus over gRPC. Tests
that exercise the wire path can still reach the loopback Channel.

Also fixes the smoke.td DROP order (materialized view before its table).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Test harnesses (mz-environmentd integration tests and sqllogictest)
construct ControllerConfig directly without starting a persist committer.
Passing a non-empty URL like http://localhost:6882 made spawned clusterds
configure RpcConsensus pointing at a non-existent listener, hanging
forever once the LD flag enabled the committer path.

Make ControllerConfig.persist_committer_url an Option<String>. The
process orchestrator only emits --persist-committer-url when Some.
environmentd's main wraps its always-present clap string in Some; test
harnesses pass None.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep the startup banner at info level. CaS-per-op logs would flood at
normal verbosity under any real workload; move them to debug so the
default MZ_LOG_FILTER doesn't drown in them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Update the in-process transport open question to reflect what we
  actually shipped (direct trait dispatch via InProcessConsensus) and
  why (loopback gRPC on the same runtime deadlocked tests).
* Add "Out of scope": the timestamp oracle retains its own direct
  CockroachDB connection and is a separate wart.
* Add "Future architecture: extract persistd as a separate service"
  motivating moving the committer (and eventually pubsub, compaction)
  out of environmentd's process. The InProcessConsensus adapter is
  flagged as a transient workaround that disappears in that future.
* Add "Future direction: protocol consolidation with pubsub" capturing
  the eventual unified PersistRouter surface.
* Add "Transport choice" calling out gRPC as a v1 expedient given both
  ends are always on the same revision; lists leaner alternatives to
  benchmark against.
* Add "Future direction: single-writer batched CaS" capturing the
  pipelined-batch RPC, multi-writer fast-fail, and the long-term
  apply-API direction motivated by the single-writer-per-shard contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Operation errors are now data on the response rather than tonic Status.
Each unary response (Head, Scan, CompareAndSet, Truncate, ListKeys) is a
oneof { ok, err } where err carries a ProtoDeterminacy classifier. tonic
Status is reserved for real transport failures, which the client treats
as Indeterminate.

Before this change, the server flattened both arms of ExternalError into
Status::internal(...) and the client mapped any tonic Status back to
Indeterminate. Determinate errors from the underlying Postgres consensus
were therefore retried indefinitely as if indeterminate, blocking
controller progress in nightly tests (wallclock-lag, introspection
sources, unified compute introspection, subscribe timeouts).

Unknown / unspecified determinacy on the client defaults to
Indeterminate so forward-compat with older servers remains retry-safe.
…refresh_interval

ALTER SYSTEM SET expects SQL string literals for duration-typed flags;
the unquoted form 5s parses as identifier and the statement fails. Match
the existing quoting convention used by other duration flags in this
file.
The orchestrator unconditionally added --persist-committer-url and
--internal-persist-committer-listen-addr to the envd argv and created
a headless committer Service. When the deployed envd image predates
the committer flags, clap rejects the unknown arguments and the
container ends in CrashLoopBackOff.

Change the orchestrator config's committer port to Option<u16> with
no default. When set, orchestratord creates the Service and emits the
two argv flags as before. When unset, neither resource is produced,
so older envd images boot unchanged. Helm chart values can opt in
once the deployed envd is known to understand the flags.

Service teardown remains unconditional because delete_resource is
idempotent on 404s, which lets a Some -> None reconfiguration clean
up a previously-created Service.
The committer's eager open of the metadata backend connection panics
envd if CRDB / postgres is not yet accepting on TCP at startup. Older
envd images absorbed this race inside persist's own retry logic; the
committer wiring puts the open on a critical-path call that gets no
retries.

Wrap the open in mz_ore::retry::Retry with a 60 s max duration and a
2 s backoff cap so envd boots through transient backend availability
gaps without losing the eventual fatal-error behavior on a permanent
misconfiguration.
antiguru and others added 26 commits May 30, 2026 07:21
…vice

The clusterd inherits --persist-committer-url from the envd default,
which uses localhost. Inside the materialized container under the
process orchestrator the loopback name resolution differs from the
explicit 127.0.0.1 the pubsub override uses, producing a mismatch
between the address envd binds and the address clusterds dial. Set
PERSIST_COMMITTER_URL alongside the existing pubsub overrides so the
two stay consistent.
Wires the previously-declared rpc_total, rpc_duration_seconds, and
inflight_rpcs metrics into the tonic handlers via an RAII RpcGuard.
Adds backing_duration_seconds (time in the underlying Consensus call
only), inflight_rpcs_by_op, and cas_refresh_lag_seconds. Diffing
rpc_duration_seconds against backing_duration_seconds isolates
committer-internal cost from backing-store cost; inflight_rpcs_by_op
plus a sudden rise in backing_duration_seconds points at PG pool
exhaustion; cas_refresh_lag_seconds reveals whether the
post-mismatch background head() is losing races against the client's
next CaS retry.

Spawns a periodic stats heartbeat task that logs one INFO line per
configured interval summarizing per-op counts, mean latencies, cache
hit rate, CaS outcome breakdown, in-flight gauges, and refresh lag.
Default 30s; controlled by the new
persist_committer_stats_heartbeat_interval dyncfg (0s disables).
Useful when Prometheus is not available, e.g. CI nightly logs.
Both directions hit the default cap immediately: a scan response of
4.7 MiB triggered "decoded message length too large: found 4737597
bytes, the limit is: 4194304 bytes" and persist's normal retry loop
spun forever because every retry produced the same oversized payload.

Persist consensus state grows past 4 MiB on hot shards by design, so
the only correct fix is to disable the limits. Matches the precedent
in mz_persist_client::rpc, which uses usize::MAX for pubsub for the
same reason.
Adds three log lines for parity with the pubsub diagnostics that
operators rely on when reading raw envd / clusterd logs in CI:

  Connecting to Persist Committer: <url>
  Connected to Persist Committer: <url>
  Received Persist Committer connection from: <addr>

The first fires at clusterd startup before the lazy gRPC channel is
constructed. The second latches once per RpcConsensus after the
first successful round trip. The third fires once per remote peer on
the server side, gated by a BTreeSet on the committer.

set_committer_channel now carries the URL alongside the Channel so
RpcConsensus can render it in the connected line. The URL is used
purely for diagnostics; routing still happens entirely over the
existing channel.
subscribe_inner read the snapshot first and registered the broadcast
receiver second. Any CaS that committed during that window published
to a registry with no listener, and the new subscriber's first event
was a later diff while the just-committed value was silently dropped.

Swap the order: register first, then read the snapshot. Diffs that
commit during the snapshot read are now buffered in the broadcast
channel and delivered to the receiver. Callers already need to dedup
by seqno because the snapshot can include diffs that are then
redelivered, so the new worst case is a duplicate event, never a
missed one.
Adds a mz-persistd binary crate that runs the same persist committer
logic environmentd embeds, but in its own process. Clusterds (and any
co-deployed environmentd) point --persist-committer-url at it.

Includes:

  * src/persistd: binary crate, mzbuild image, Dockerfile
  * misc/python/materialize/mzcompose/services/persistd.py: Persistd
    Service helper for compositions
  * test/persistd/mzcompose.py: smoke composition that brings up
    persistd alongside materialized, points MZ_PERSIST_COMMITTER_URL
    at the persistd container, and exercises a small testdrive flow
    plus restart survival
  * ci/test/pipeline.template.yml: nightly step for the new
    composition

The standalone committer is the v1 step toward extracting persist out
of envd entirely. environmentd still starts its own in-process
committer; pointing MZ_PERSIST_COMMITTER_URL externally only affects
clusterds in this configuration. A follow-up will let envd skip the
in-process committer when configured to use an external persistd.
Adds an --external-persist-committer / MZ_EXTERNAL_PERSIST_COMMITTER
flag. When set, environmentd does not start its own in-process
committer; instead it dials --persist-committer-url as a gRPC client,
the same path clusterds use. Pairs with the new persistd binary so a
single standalone committer serves both envd and clusterd consensus
traffic.

In-process remains the default. The persist_consensus_use_committer
LD flag still gates whether any particular consensus open actually
routes through the committer, so this flag only changes where the
committer lives, not whether it's used.

test/persistd/mzcompose.py now sets MZ_EXTERNAL_PERSIST_COMMITTER=true
so the smoke composition exercises the full external-persistd path.
mzbuild's Cargo.toml parser counts both the explicit [[bin]] entry and
the autobins discovery of src/bin/persistd.rs, producing 'bin persistd
appears more than once in cargo workspace'. Autobins alone suffices
since the file name matches the desired binary name.
Adds a generic Service.companions list and extends the Composition
SERVICES loop to register companions alongside their parent. Uses the
mechanism in Materialized to auto-attach a paired Persistd container
whenever an external Postgres-flavored metadata store is in use.

Default behavior: every Materialized in every composition gains a
sibling <materialized-name>-persistd container that owns the persist
consensus traffic for that envd. envd dials it as a client over gRPC,
the same path clusterds use, via MZ_EXTERNAL_PERSIST_COMMITTER=true.
Survives envd restarts and container swaps, which fixes the loopback
wedge legacy-upgrade hit (clusterds left dialing a dead envd's
127.0.0.1:6882 across a Materialized container swap).

Opt out per-test with `Materialized(external_persist_committer=False)`
when topology assertions or specialized wiring demand it. The
dedicated test/persistd composition disables auto-attach to keep its
explicit named persistd container as the only one in play.
The standalone persistd composition referenced postgres-metadata in
Persistd.depends_on but did not declare the service, so docker compose
rejected the project with 'undefined service postgres-metadata'.

Use metadata_store_services() to add the right metadata store
(postgres-metadata / cockroach / etc. per env) and route persistd's
--consensus-url at the same hostname instead of the hardcoded default.
…ache

bin/bump-version now updates persistd's Cargo.toml version alongside
other binary crates. Source comment touched to force CI to re-resolve
the persistd image hash; the previous run reportedly didn't produce
the persistd binary despite mkpipeline listing it.
`tonic::transport::Endpoint::connect_lazy` builds the hyper client, which
immediately calls `tokio::spawn` to drive its connection pool. Outside a
runtime context, that panics with `there is no reactor running, must be
called from the context of a Tokio 1.x runtime`. envd's
`--external-persist-committer` path hit this because the wiring sits in
`fn run(args) -> Result<(), anyhow::Error>` ahead of the first async
hop, so the runtime was constructed but no task was on it yet.

Wrap the `connect_lazy()` call in `runtime.block_on(async { ... })` so
the implicit `tokio::spawn` lands on the right runtime.
`Composition.override()` rewrote the service entry for each target but
never walked its `companions` list. When a `Materialized` auto-attached
a paired `Persistd`, the override path landed an envd entry whose
`depends_on` referenced an undefined `materialized-persistd` service,
and docker compose rejected the project with:

  service "materialized" depends on undefined service
  "materialized-persistd": invalid compose project

The static SERVICES path already recursed into companions; bring
`override()` in line. Companions are exempt from the
`fail_on_new_service` check because they piggyback on the parent
service rather than appearing in the composition's SERVICES list.
The wholesale `self.compose = old_compose` restore in the `finally`
block already handles cleanup.
The Subscribe path was advertised as the cache-freshness mechanism for
multi-writer scenarios, but no client ever opened a Subscribe stream:

  * `RpcConsensus` has no `subscribe` method.
  * `InProcessConsensus` has no `subscribe` method.
  * `refresh::spawn_refresh` only refreshed shards reported by
    `cache.shards_with_subscribers()`, which is always empty, so the
    TTL refresh never executed.

The cache therefore self-corrected solely via the CaS-mismatch
fire-and-forget head() refresh in `cas_inner`. Ship that, drop the
inert machinery:

  * Delete `subscribe.rs` (SubscriberRegistry + broadcast).
  * Delete `refresh.rs` (periodic per-subscribed-shard refresh).
  * Drop the `Subscribe` RPC and `Proto{Subscribe,Subscribe*Message}`
    messages from the proto. Rebuilt protos no longer expose them.
  * Strip the `subscribe_inner` method, broadcast plumbing, and
    `SubscribeStream` type from `server.rs`. The remaining
    `spawn_refresh` is the CaS-mismatch refresh and stays.
  * Strip `subscribers` count, `SubscriberToken`,
    `shards_with_subscribers`, and the subscriber-pinned LRU logic
    from `ShardCache`. Eviction now picks the oldest entry; the LRU
    pin was only meaningful to the periodic refresh.
  * Drop `PERSIST_COMMITTER_CACHE_REFRESH_INTERVAL` dyncfg, the
    matching CLI flag on `persistd`, the `cache_refresh_interval`
    field on `CommitterConfig`, the parallel_workload flag entries,
    and the `UNINTERESTING_SYSTEM_PARAMETERS` reference.
  * Drop now-unused `async-stream` and `tokio-stream` dependencies
    from `mz-persist-committer`.

No behavior change: those code paths were unreachable. The design
doc's "subscriber-driven freshness" claim is now also fiction-free.
* Drop the Subscribe RPC, SubscriberRegistry, and periodic refresh
  task from the Components, Data-flow, Cache-freshness, and Error-
  handling sections to match the now-shipped surface.
* Strike the `persist_committer_cache_refresh_interval` flag and
  add the actual `persist_committer_stats_heartbeat_interval` knob
  the binary respects.
* Update Metrics to the set the code registers (rpc_total,
  rpc_duration_seconds, backing_duration_seconds, cache_hits,
  cache_misses, cached_shards, inflight_rpcs[_by_op],
  cas_refresh_lag_seconds); flag overload protection / queue depth
  as a follow-up rather than implemented.
* Drop the "parameterize the entire persist correctness suite"
  testing claim; describe the unit, mzcompose, platform-check,
  and parallel-workload tests that actually exist.
* Move "standalone persistd binary" out of Out-of-scope (it ships
  here as a binary + image + mzcompose companion auto-attach).
* Document the omitted Subscribe RPC explicitly under Out-of-scope
  so the freshness story is honest about relying on CaS-mismatch
  refresh.
* Rewrite "Future architecture: extract persistd" into "Standalone
  persistd" describing what landed (binary, env-flag, companion)
  and what is still TODO (orchestratord, helm, in-envd removal).
* Drop the docs/superpowers/plans/ scratch plan from the PR.

No code changes.
… on healthy

Three diagnosable / preventable failures observed in feature-benchmark
and parallel-benchmark:

* The startup INFO line did not include `consensus_url`, so when
  persistd repeatedly logged `waiting for consensus backend:
  indeterminate: error connecting to server`, operators had no way
  to confirm whether the URL passed by the orchestrator matched
  the metadata store they expected. `SensitiveUrl` masks any
  password component, so logging the URL at INFO is safe.

* The startup retry was bounded at 60s. When the metadata store
  exposes `service_started` (process alive) but not yet
  `service_healthy` (accepting connections) for longer than that,
  persistd exhausted its retries and exited rather than waiting,
  which races compose's own healthcheck. Drop the cap and let the
  loop run indefinitely; compose's healthcheck timeout governs.

* The `Materialized`-auto-attached `Persistd` companion wired its
  metadata-store `depends_on` as `service_started`, while envd
  itself depends on the same store with `service_healthy`. That
  left persistd guaranteed to outrun the backend on cold start in
  any composition where the backend takes more than a few seconds
  to begin accepting connections. Mirror envd's `service_healthy`
  condition.

Also log `consensus backend ready` once `open()` succeeds, so
that a long startup followed by a quiet steady state is
distinguishable from a hung retry loop.
…chema

Two robustness fixes prompted by a deployment where the consensus backing
store was reset under a running persistd: every consensus op then failed
forever with an opaque "db error" and the deployment stalled silently.

Diagnostics (shared by persistd and the in-envd committer):
* Log a failed backing-consensus op at warn with its full cause chain, so
  the backend's real error (e.g. a Postgres SQLSTATE) is visible. The
  per-op debug traces are off by default, so failures were silent.
* Send the full cause chain over gRPC instead of the top-level Display.
  For a Postgres-backed error the top level is the opaque "db error"; the
  SQLSTATE and message live in the source chain, which clients (and
  operators reading their logs) otherwise never saw.

Recovery (standalone persistd only):
* Add ExternalError::is_undefined_table, detecting SQLSTATE 42P01.
* Wrap persistd's consensus in TerminateOnMissingSchema: if the consensus
  relation vanishes (the backing store was reset), persistd exits non-zero
  so its supervisor restarts it and PostgresConsensus::open recreates the
  schema, instead of spinning forever. The policy is intentionally local
  to the standalone binary; the in-process committer in environmentd shares
  the same committer code but must never self-terminate.

Tests: a committer unit test asserts the gRPC error preserves the cause
chain; the env-gated postgres consensus test asserts is_undefined_table
fires after the consensus table is dropped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Workspace-wide cargo feature unification compiles mz-persist with the
foundationdb feature (enabled by materialized), so the persistd binary
carries a DT_NEEDED libfdb_c.so entry even though persistd does not enable
the feature itself. The distroless base lacks the library, so the process
failed at load with "libfdb_c.so: cannot open shared object file".

Copy the FoundationDB client library into the image, mirroring ubuntu-base
which does the same for materialized.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…exit

Two follow-ups from running the committer against a metadata-store reset.

* time_backing logged every failed backing op at warn, including CRDB
  serialization conflicts (determinate, retried by the client). That floods
  the log with hundreds of routine "could not serialize access" lines. Only
  warn on indeterminate failures; the 42P01 schema-loss error stays visible.

* The standalone persistd exits non-zero when its consensus schema vanishes
  so it can be restarted and recreate the schema, but the mzcompose Persistd
  service had no restart policy. The container stayed down and its DNS name
  stopped resolving, leaving materialized stuck on "connection refused" /
  "dns error". Give persistd a bounded on-failure restart policy, matching
  cockroach and postgres. Restart is the orchestrator's job, not an
  entrypoint loop.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Each scenario kills and removes cockroach, materialized, clusterd, and
testdrive and wipes the mzdata volume, but left the auto-attached
materialized-persistd companion running. The orphaned persistd then faced
the next scenario's fresh, empty cockroach: its consensus schema was gone,
so it exited and the next materialized wedged unhealthy ("Found orphan
containers", then repeated retries).

Tear persistd down together with the metadata store it fronts. The
companion is only a declared service inside the mz override, so the kill/rm
run within `c.override(mz, clusterd)`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The distroless base lacked shared libraries the persistd binary needs.
Feature unification compiles mz-persist with the same heavy dependency set
as environmentd (FoundationDB, plus xz/lzma via arrow/parquet), so persistd
needs libfdb_c.so, liblzma.so.5, and friends at load time. Bundling each by
hand was whack-a-mole (libfdb_c.so, then liblzma.so.5, ...).

Switch from distroless-prod-base to prod-base, which provides the full
Ubuntu runtime via ubuntu-base, matching environmentd and clusterd. Drops
the manual libfdb_c.so copy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The terminate-on-missing-schema path returned an error that bubbled up to a
panic! in main. A panic prints a backtrace (slow, and RUST_BACKTRACE=1 is set
in the image) and is counted as a failure by the test harness even though the
exit is intentional.

Log the cause and std::process::exit(1) directly from the schema-lost branch.
The supervisor still restarts persistd, and open() recreates the schema, but
without the panic and backtrace.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When the backing store is reset out from under a running persistd (test
harnesses wipe the metadata store between phases; a fresh store is
provisioned), the consensus relation disappears and every op fails with
SQLSTATE 42P01. The previous behavior exited the process for a supervised
restart, but harnesses strand the persistd companion faster than restart +
health-wait can recover, cascading into "materialized is unhealthy" failures
(pg-cdc / mysql-cdc source-versioning migration, feature-benchmark).

Recover in place instead: SelfHealingConsensus wraps the backend and, on an
undefined-table error, re-opens consensus (re-running the idempotent CREATE
SCHEMA/TABLE IF NOT EXISTS DDL) and retries. The reopened handle is
discarded; the original pool finds the recreated table on retry. This is
local to the standalone binary — the in-process committer in environmentd
must never silently recreate a missing schema, which would mask data loss.

This removes the need for the terminate-and-restart scaffolding, so revert
the persistd restart policy and the feature-benchmark per-scenario teardown.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
In k8s the persist committer runs inside environmentd; clusterds in their own
pods must reach it over the network. The committer Service and the
--persist-committer-url / --internal-persist-committer-listen-addr args
existed but were gated behind an opt-in port that defaulted to unset, so envd
fell back to its loopback committer URL and threaded that to clusterds. They
got localhost:6882 (their own empty pod), every persist read was refused, and
0dt rollouts never reached ReadyToPromote.

Mirror the pubsub wiring: default the committer port (6882) and pass the
routable per-generation headless Service DNS as the committer URL, so envd
listens on 0.0.0.0 and clusterds dial the Service. Gate the args and Service
on meets_minimum_version(V26_27_0_DEV0) — the version that introduced the
committer flags — so an upgrade from an older envd image (which would exit on
the unknown flags) stays safe. Replaces the opt-in Option<port> with a
version check, matching how other envd args are gated across upgrades.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Rebase onto main moved the workspace to 26.28.0-dev.0. Bump persistd's
hardcoded version to match, and move the committer version gate from
26.27.0-dev.0 to 26.28.0-dev.0 — the dev version the feature now lands in —
so a 26.27 image (released without the committer flags) is correctly excluded
during upgrades.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@antiguru antiguru force-pushed the persist-committer-design branch from 068471c to 227abac Compare May 30, 2026 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant