persist-committer: in-envd consensus committer service by antiguru · Pull Request #36762 · MaterializeInc/materialize

antiguru · 2026-05-28T01:30:10Z

Introduces a centralized persist committer service hosted by environmentd.

Each environment funnels all persist consensus traffic through a single CockroachDB connection pool instead of every clusterd opening its own.
A new mz-persist-committer crate exposes the Consensus trait over gRPC, backed by a monotonic in-memory ShardCache.
Clusterds and envd itself implement Consensus via a new RpcConsensus that targets the committer.
Routing is gated behind the persist_consensus_use_committer LD flag (default off).

Design doc: doc/developer/design/20260527_persist_committer.md.
Plan: docs/superpowers/plans/2026-05-27-persist-committer.md.

Status: draft.
Open before merge: cross-crate integration test, full metric instrumentation, validation that set_committer_channel ordering versus open_consensus is safe under all production startup paths, and a security review of accepting arbitrary persist consensus traffic over the committer socket.

Proposes an envd-hosted persist committer that funnels all consensus traffic from clusterds through a single CRDB connection pool, replacing the current per-process pool. Includes monotonic-cache rules for the multi-writer (failover/rollout) case, rollout flags, and metrics plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* clusterd discovers committer via --persist-committer-url clap arg * committer listens on dedicated new port (default 6878) * in-process transport uses tonic in-memory channel for envd self-use Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Skeleton crate with empty submodules. Functionality lands in subsequent commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the proto definition for Head/Scan/CompareAndSet/Truncate/Subscribe and wires the prost+tonic build script for the crate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ShardCache: BTreeMap-backed per-shard state with monotonic insert and LRU eviction; subscribers pin entries. SubscriberRegistry: per-shard bounded broadcast over tokio channels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The real Consensus::compare_and_set takes no `expected` argument (it is implicit in new.seqno - 1) and CaSResult does not carry current state on mismatch. Drop the expected/mismatch-current proto fields, add ListKeys (admin path), and make truncate's deleted count optional. Design doc now specifies a background head() refresh on ExpectationMismatch instead of the impossible cache-from-CaS-response update. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PersistCommitter wraps a backing Consensus, serves Head from the cache on hit and populates it on miss, forwards Scan/Truncate unchanged, and on CaS Committed updates the cache plus broadcasts to subscribers. On CaS ExpectationMismatch the committer fires a background head() to refresh the cache since the underlying trait does not return current state on mismatch. list_keys collects the underlying stream into a vec for the gRPC reply. Adds Debug derive to SubscriberRegistry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PersistCommitter now implements ProtoPersistConsensus, marshalling Head/Scan/CompareAndSet/Truncate/ListKeys plus the Subscribe server-streaming RPC over a tokio broadcast bridge. Adds async-stream and tokio-stream deps. Subscribe sends the current cached snapshot first, then forwards subsequent diffs; lagged subscribers re-sync via Head. The refresh task periodically re-reads every subscribed shard from the backing Consensus, monotonic-merges into the cache, and broadcasts a diff if the seqno advanced. Defends against silent writes from another writer during failover or rollout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CommitterMetrics holds the standard cache_hits/misses, rpc counters and latency histograms, plus inflight + cached-shards gauges. Wired through PersistCommitter; head_inner records cache hits/misses. Counter and gauge instrumentation in the gRPC handlers themselves lands later. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

RpcConsensus wraps a tonic Channel and implements the Consensus trait over the ProtoPersistConsensus gRPC service: list_keys collects the server's response into a small futures::Stream, head/scan/compare_and_set/ truncate marshal proto requests, and CaSResult is reconstituted from the bool reply field (the committer's gRPC response intentionally does not carry current state on mismatch). To break a would-be circular dep, mz-persist-committer drops its (unused) mz-persist-client dependency and mz-persist-client now depends on mz-persist-committer for the generated proto types. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PersistClientCache gains an optional committer Channel set via a set_committer_channel helper. When persist_consensus_use_committer is true and a channel is present, open_consensus returns an RpcConsensus wrapped in MetricsConsensus + Tasked just like the Postgres path; the flag default is false so existing callers see no behavior change. Adds the four committer LD flags to the dyncfg registry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When the optional URL is supplied (CLI or PERSIST_COMMITTER_URL env), the clusterd builds a lazy tonic channel and registers it on the PersistClientCache. The persist_consensus_use_committer LD flag controls whether PersistClientCache actually routes through it; without the flag the channel is ignored. Orchestrator-side argv plumbing is intentionally deferred: mzcompose and local runs can set the URL via env var. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds mz_persist_committer::start_committer that bundles cache + registry + refresh + gRPC server behind one entry point and returns a lazy loopback Channel suitable for set_committer_channel. envd opens its own PostgresConsensus via ConsensusConfig::try_from + open, hands it to start_committer, then registers the loopback channel on the PersistClientCache. The committer always runs; whether envd actually routes its own traffic through it is controlled by the persist_consensus_use_committer LD flag (default off). The listen address is configurable via the new --internal-persist-committer-listen-addr flag (default 127.0.0.1:6878). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* New test/persist-committer/ mzcompose composition runs Materialize with persist_consensus_use_committer enabled, exercises CaS via SQL, and asserts that shard state survives an envd kill+restart. * New PersistCommitter platform check verifies the same shard survives the platform-checks restart/upgrade matrix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reformatted the proto with buf and the Python check with black/ruff. Pulls in a Cargo.lock update that was stale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ControllerConfig.persist_committer_url is a required String matching the persist_pubsub_url pattern; defaults to http://localhost:6880 in envd's clap arg. * Process orchestrator (controller/clusters.rs) appends --persist-committer-url=<url> to clusterd argv unconditionally. * Kubernetes orchestratord creates a per-generation headless Service named `mz<id>-persist-committer-<generation>` selecting the envd statefulset, on the new --environmentd-internal-persist-committer-port (default 6880). envd's argv now includes both --persist-committer-url pointing at that service and --internal-persist-committer-listen-addr. * Committer port moved from 6878 (collides with internal_http_port) to 6880. * Smoke composition wired into ci/test/pipeline.template.yml under build-aarch64. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Use mz_ore::test(tokio::test) instead of #[tokio::test]. * Use Arc::clone(&x) instead of x.clone() on ref-counted pointers. * Replace `as` casts with try_from in tests; use usize::from for bool->int. * Allow clippy::clone_on_ref_ptr and as_conversions on the prost/tonic generated proto module. * Rephrase ProtoTruncateResponse.deleted doc comment to avoid generating an unclosed-HTML-tag rustdoc warning. * mzcompose.py workflow_default now loops over workflows with test_case. * Register the four new persist_committer LD flags in parallel_workload's FlipFlagsAction and in mzcompose UNINTERESTING_SYSTEM_PARAMETERS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

For staging exercise only. Revert before merging the PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

6880 collides with the SQL password listener defined in src/materialized/ci/listener_configs/testdrive.json (and friends). Move the committer to 6882 to sit outside the 6875-6881 range already used by listeners and testdrive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Routing envd's own persist traffic through tonic on the same runtime as the committer's gRPC server deadlocks the runtime under load (observed in pgwire and sqllogictest suites). Mirror the pubsub new_same_process_connection pattern: the committer exposes an InProcessConsensus that dispatches directly into PersistCommitter::*_inner, and start_committer returns it on CommitterHandle alongside the existing loopback Channel. envd now configures the PersistClientCache with the in-process consensus via the new set_committer_in_process helper, which takes precedence over any committer Channel. Clusterds still use RpcConsensus over gRPC. Tests that exercise the wire path can still reach the loopback Channel. Also fixes the smoke.td DROP order (materialized view before its table). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Test harnesses (mz-environmentd integration tests and sqllogictest) construct ControllerConfig directly without starting a persist committer. Passing a non-empty URL like http://localhost:6882 made spawned clusterds configure RpcConsensus pointing at a non-existent listener, hanging forever once the LD flag enabled the committer path. Make ControllerConfig.persist_committer_url an Option<String>. The process orchestrator only emits --persist-committer-url when Some. environmentd's main wraps its always-present clap string in Some; test harnesses pass None. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Keep the startup banner at info level. CaS-per-op logs would flood at normal verbosity under any real workload; move them to debug so the default MZ_LOG_FILTER doesn't drown in them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Update the in-process transport open question to reflect what we actually shipped (direct trait dispatch via InProcessConsensus) and why (loopback gRPC on the same runtime deadlocked tests). * Add "Out of scope": the timestamp oracle retains its own direct CockroachDB connection and is a separate wart. * Add "Future architecture: extract persistd as a separate service" motivating moving the committer (and eventually pubsub, compaction) out of environmentd's process. The InProcessConsensus adapter is flagged as a transient workaround that disappears in that future. * Add "Future direction: protocol consolidation with pubsub" capturing the eventual unified PersistRouter surface. * Add "Transport choice" calling out gRPC as a v1 expedient given both ends are always on the same revision; lists leaner alternatives to benchmark against. * Add "Future direction: single-writer batched CaS" capturing the pipelined-batch RPC, multi-writer fast-fail, and the long-term apply-API direction motivated by the single-writer-per-shard contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Operation errors are now data on the response rather than tonic Status. Each unary response (Head, Scan, CompareAndSet, Truncate, ListKeys) is a oneof { ok, err } where err carries a ProtoDeterminacy classifier. tonic Status is reserved for real transport failures, which the client treats as Indeterminate. Before this change, the server flattened both arms of ExternalError into Status::internal(...) and the client mapped any tonic Status back to Indeterminate. Determinate errors from the underlying Postgres consensus were therefore retried indefinitely as if indeterminate, blocking controller progress in nightly tests (wallclock-lag, introspection sources, unified compute introspection, subscribe timeouts). Unknown / unspecified determinacy on the client defaults to Indeterminate so forward-compat with older servers remains retry-safe.

…refresh_interval ALTER SYSTEM SET expects SQL string literals for duration-typed flags; the unquoted form 5s parses as identifier and the statement fails. Match the existing quoting convention used by other duration flags in this file.

The orchestrator unconditionally added --persist-committer-url and --internal-persist-committer-listen-addr to the envd argv and created a headless committer Service. When the deployed envd image predates the committer flags, clap rejects the unknown arguments and the container ends in CrashLoopBackOff. Change the orchestrator config's committer port to Option<u16> with no default. When set, orchestratord creates the Service and emits the two argv flags as before. When unset, neither resource is produced, so older envd images boot unchanged. Helm chart values can opt in once the deployed envd is known to understand the flags. Service teardown remains unconditional because delete_resource is idempotent on 404s, which lets a Some -> None reconfiguration clean up a previously-created Service.

The committer's eager open of the metadata backend connection panics envd if CRDB / postgres is not yet accepting on TCP at startup. Older envd images absorbed this race inside persist's own retry logic; the committer wiring puts the open on a critical-path call that gets no retries. Wrap the open in mz_ore::retry::Retry with a 60 s max duration and a 2 s backoff cap so envd boots through transient backend availability gaps without losing the eventual fatal-error behavior on a permanent misconfiguration.

…vice The clusterd inherits --persist-committer-url from the envd default, which uses localhost. Inside the materialized container under the process orchestrator the loopback name resolution differs from the explicit 127.0.0.1 the pubsub override uses, producing a mismatch between the address envd binds and the address clusterds dial. Set PERSIST_COMMITTER_URL alongside the existing pubsub overrides so the two stay consistent.

Wires the previously-declared rpc_total, rpc_duration_seconds, and inflight_rpcs metrics into the tonic handlers via an RAII RpcGuard. Adds backing_duration_seconds (time in the underlying Consensus call only), inflight_rpcs_by_op, and cas_refresh_lag_seconds. Diffing rpc_duration_seconds against backing_duration_seconds isolates committer-internal cost from backing-store cost; inflight_rpcs_by_op plus a sudden rise in backing_duration_seconds points at PG pool exhaustion; cas_refresh_lag_seconds reveals whether the post-mismatch background head() is losing races against the client's next CaS retry. Spawns a periodic stats heartbeat task that logs one INFO line per configured interval summarizing per-op counts, mean latencies, cache hit rate, CaS outcome breakdown, in-flight gauges, and refresh lag. Default 30s; controlled by the new persist_committer_stats_heartbeat_interval dyncfg (0s disables). Useful when Prometheus is not available, e.g. CI nightly logs.

Both directions hit the default cap immediately: a scan response of 4.7 MiB triggered "decoded message length too large: found 4737597 bytes, the limit is: 4194304 bytes" and persist's normal retry loop spun forever because every retry produced the same oversized payload. Persist consensus state grows past 4 MiB on hot shards by design, so the only correct fix is to disable the limits. Matches the precedent in mz_persist_client::rpc, which uses usize::MAX for pubsub for the same reason.

Adds three log lines for parity with the pubsub diagnostics that operators rely on when reading raw envd / clusterd logs in CI: Connecting to Persist Committer: <url> Connected to Persist Committer: <url> Received Persist Committer connection from: <addr> The first fires at clusterd startup before the lazy gRPC channel is constructed. The second latches once per RpcConsensus after the first successful round trip. The third fires once per remote peer on the server side, gated by a BTreeSet on the committer. set_committer_channel now carries the URL alongside the Channel so RpcConsensus can render it in the connected line. The URL is used purely for diagnostics; routing still happens entirely over the existing channel.

subscribe_inner read the snapshot first and registered the broadcast receiver second. Any CaS that committed during that window published to a registry with no listener, and the new subscriber's first event was a later diff while the just-committed value was silently dropped. Swap the order: register first, then read the snapshot. Diffs that commit during the snapshot read are now buffered in the broadcast channel and delivered to the receiver. Callers already need to dedup by seqno because the snapshot can include diffs that are then redelivered, so the new worst case is a duplicate event, never a missed one.

Adds a mz-persistd binary crate that runs the same persist committer logic environmentd embeds, but in its own process. Clusterds (and any co-deployed environmentd) point --persist-committer-url at it. Includes: * src/persistd: binary crate, mzbuild image, Dockerfile * misc/python/materialize/mzcompose/services/persistd.py: Persistd Service helper for compositions * test/persistd/mzcompose.py: smoke composition that brings up persistd alongside materialized, points MZ_PERSIST_COMMITTER_URL at the persistd container, and exercises a small testdrive flow plus restart survival * ci/test/pipeline.template.yml: nightly step for the new composition The standalone committer is the v1 step toward extracting persist out of envd entirely. environmentd still starts its own in-process committer; pointing MZ_PERSIST_COMMITTER_URL externally only affects clusterds in this configuration. A follow-up will let envd skip the in-process committer when configured to use an external persistd.

Adds an --external-persist-committer / MZ_EXTERNAL_PERSIST_COMMITTER flag. When set, environmentd does not start its own in-process committer; instead it dials --persist-committer-url as a gRPC client, the same path clusterds use. Pairs with the new persistd binary so a single standalone committer serves both envd and clusterd consensus traffic. In-process remains the default. The persist_consensus_use_committer LD flag still gates whether any particular consensus open actually routes through the committer, so this flag only changes where the committer lives, not whether it's used. test/persistd/mzcompose.py now sets MZ_EXTERNAL_PERSIST_COMMITTER=true so the smoke composition exercises the full external-persistd path.

mzbuild's Cargo.toml parser counts both the explicit [[bin]] entry and the autobins discovery of src/bin/persistd.rs, producing 'bin persistd appears more than once in cargo workspace'. Autobins alone suffices since the file name matches the desired binary name.

Adds a generic Service.companions list and extends the Composition SERVICES loop to register companions alongside their parent. Uses the mechanism in Materialized to auto-attach a paired Persistd container whenever an external Postgres-flavored metadata store is in use. Default behavior: every Materialized in every composition gains a sibling <materialized-name>-persistd container that owns the persist consensus traffic for that envd. envd dials it as a client over gRPC, the same path clusterds use, via MZ_EXTERNAL_PERSIST_COMMITTER=true. Survives envd restarts and container swaps, which fixes the loopback wedge legacy-upgrade hit (clusterds left dialing a dead envd's 127.0.0.1:6882 across a Materialized container swap). Opt out per-test with `Materialized(external_persist_committer=False)` when topology assertions or specialized wiring demand it. The dedicated test/persistd composition disables auto-attach to keep its explicit named persistd container as the only one in play.

The standalone persistd composition referenced postgres-metadata in Persistd.depends_on but did not declare the service, so docker compose rejected the project with 'undefined service postgres-metadata'. Use metadata_store_services() to add the right metadata store (postgres-metadata / cockroach / etc. per env) and route persistd's --consensus-url at the same hostname instead of the hardcoded default.

…ache bin/bump-version now updates persistd's Cargo.toml version alongside other binary crates. Source comment touched to force CI to re-resolve the persistd image hash; the previous run reportedly didn't produce the persistd binary despite mkpipeline listing it.

`tonic::transport::Endpoint::connect_lazy` builds the hyper client, which immediately calls `tokio::spawn` to drive its connection pool. Outside a runtime context, that panics with `there is no reactor running, must be called from the context of a Tokio 1.x runtime`. envd's `--external-persist-committer` path hit this because the wiring sits in `fn run(args) -> Result<(), anyhow::Error>` ahead of the first async hop, so the runtime was constructed but no task was on it yet. Wrap the `connect_lazy()` call in `runtime.block_on(async { ... })` so the implicit `tokio::spawn` lands on the right runtime.

`Composition.override()` rewrote the service entry for each target but never walked its `companions` list. When a `Materialized` auto-attached a paired `Persistd`, the override path landed an envd entry whose `depends_on` referenced an undefined `materialized-persistd` service, and docker compose rejected the project with: service "materialized" depends on undefined service "materialized-persistd": invalid compose project The static SERVICES path already recursed into companions; bring `override()` in line. Companions are exempt from the `fail_on_new_service` check because they piggyback on the parent service rather than appearing in the composition's SERVICES list. The wholesale `self.compose = old_compose` restore in the `finally` block already handles cleanup.

The Subscribe path was advertised as the cache-freshness mechanism for multi-writer scenarios, but no client ever opened a Subscribe stream: * `RpcConsensus` has no `subscribe` method. * `InProcessConsensus` has no `subscribe` method. * `refresh::spawn_refresh` only refreshed shards reported by `cache.shards_with_subscribers()`, which is always empty, so the TTL refresh never executed. The cache therefore self-corrected solely via the CaS-mismatch fire-and-forget head() refresh in `cas_inner`. Ship that, drop the inert machinery: * Delete `subscribe.rs` (SubscriberRegistry + broadcast). * Delete `refresh.rs` (periodic per-subscribed-shard refresh). * Drop the `Subscribe` RPC and `Proto{Subscribe,Subscribe*Message}` messages from the proto. Rebuilt protos no longer expose them. * Strip the `subscribe_inner` method, broadcast plumbing, and `SubscribeStream` type from `server.rs`. The remaining `spawn_refresh` is the CaS-mismatch refresh and stays. * Strip `subscribers` count, `SubscriberToken`, `shards_with_subscribers`, and the subscriber-pinned LRU logic from `ShardCache`. Eviction now picks the oldest entry; the LRU pin was only meaningful to the periodic refresh. * Drop `PERSIST_COMMITTER_CACHE_REFRESH_INTERVAL` dyncfg, the matching CLI flag on `persistd`, the `cache_refresh_interval` field on `CommitterConfig`, the parallel_workload flag entries, and the `UNINTERESTING_SYSTEM_PARAMETERS` reference. * Drop now-unused `async-stream` and `tokio-stream` dependencies from `mz-persist-committer`. No behavior change: those code paths were unreachable. The design doc's "subscriber-driven freshness" claim is now also fiction-free.

* Drop the Subscribe RPC, SubscriberRegistry, and periodic refresh task from the Components, Data-flow, Cache-freshness, and Error- handling sections to match the now-shipped surface. * Strike the `persist_committer_cache_refresh_interval` flag and add the actual `persist_committer_stats_heartbeat_interval` knob the binary respects. * Update Metrics to the set the code registers (rpc_total, rpc_duration_seconds, backing_duration_seconds, cache_hits, cache_misses, cached_shards, inflight_rpcs[_by_op], cas_refresh_lag_seconds); flag overload protection / queue depth as a follow-up rather than implemented. * Drop the "parameterize the entire persist correctness suite" testing claim; describe the unit, mzcompose, platform-check, and parallel-workload tests that actually exist. * Move "standalone persistd binary" out of Out-of-scope (it ships here as a binary + image + mzcompose companion auto-attach). * Document the omitted Subscribe RPC explicitly under Out-of-scope so the freshness story is honest about relying on CaS-mismatch refresh. * Rewrite "Future architecture: extract persistd" into "Standalone persistd" describing what landed (binary, env-flag, companion) and what is still TODO (orchestratord, helm, in-envd removal). * Drop the docs/superpowers/plans/ scratch plan from the PR. No code changes.

… on healthy Three diagnosable / preventable failures observed in feature-benchmark and parallel-benchmark: * The startup INFO line did not include `consensus_url`, so when persistd repeatedly logged `waiting for consensus backend: indeterminate: error connecting to server`, operators had no way to confirm whether the URL passed by the orchestrator matched the metadata store they expected. `SensitiveUrl` masks any password component, so logging the URL at INFO is safe. * The startup retry was bounded at 60s. When the metadata store exposes `service_started` (process alive) but not yet `service_healthy` (accepting connections) for longer than that, persistd exhausted its retries and exited rather than waiting, which races compose's own healthcheck. Drop the cap and let the loop run indefinitely; compose's healthcheck timeout governs. * The `Materialized`-auto-attached `Persistd` companion wired its metadata-store `depends_on` as `service_started`, while envd itself depends on the same store with `service_healthy`. That left persistd guaranteed to outrun the backend on cold start in any composition where the backend takes more than a few seconds to begin accepting connections. Mirror envd's `service_healthy` condition. Also log `consensus backend ready` once `open()` succeeds, so that a long startup followed by a quiet steady state is distinguishable from a hung retry loop.

…chema Two robustness fixes prompted by a deployment where the consensus backing store was reset under a running persistd: every consensus op then failed forever with an opaque "db error" and the deployment stalled silently. Diagnostics (shared by persistd and the in-envd committer): * Log a failed backing-consensus op at warn with its full cause chain, so the backend's real error (e.g. a Postgres SQLSTATE) is visible. The per-op debug traces are off by default, so failures were silent. * Send the full cause chain over gRPC instead of the top-level Display. For a Postgres-backed error the top level is the opaque "db error"; the SQLSTATE and message live in the source chain, which clients (and operators reading their logs) otherwise never saw. Recovery (standalone persistd only): * Add ExternalError::is_undefined_table, detecting SQLSTATE 42P01. * Wrap persistd's consensus in TerminateOnMissingSchema: if the consensus relation vanishes (the backing store was reset), persistd exits non-zero so its supervisor restarts it and PostgresConsensus::open recreates the schema, instead of spinning forever. The policy is intentionally local to the standalone binary; the in-process committer in environmentd shares the same committer code but must never self-terminate. Tests: a committer unit test asserts the gRPC error preserves the cause chain; the env-gated postgres consensus test asserts is_undefined_table fires after the consensus table is dropped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Workspace-wide cargo feature unification compiles mz-persist with the foundationdb feature (enabled by materialized), so the persistd binary carries a DT_NEEDED libfdb_c.so entry even though persistd does not enable the feature itself. The distroless base lacks the library, so the process failed at load with "libfdb_c.so: cannot open shared object file". Copy the FoundationDB client library into the image, mirroring ubuntu-base which does the same for materialized. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…exit Two follow-ups from running the committer against a metadata-store reset. * time_backing logged every failed backing op at warn, including CRDB serialization conflicts (determinate, retried by the client). That floods the log with hundreds of routine "could not serialize access" lines. Only warn on indeterminate failures; the 42P01 schema-loss error stays visible. * The standalone persistd exits non-zero when its consensus schema vanishes so it can be restarted and recreate the schema, but the mzcompose Persistd service had no restart policy. The container stayed down and its DNS name stopped resolving, leaving materialized stuck on "connection refused" / "dns error". Give persistd a bounded on-failure restart policy, matching cockroach and postgres. Restart is the orchestrator's job, not an entrypoint loop. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Each scenario kills and removes cockroach, materialized, clusterd, and testdrive and wipes the mzdata volume, but left the auto-attached materialized-persistd companion running. The orphaned persistd then faced the next scenario's fresh, empty cockroach: its consensus schema was gone, so it exited and the next materialized wedged unhealthy ("Found orphan containers", then repeated retries). Tear persistd down together with the metadata store it fronts. The companion is only a declared service inside the mz override, so the kill/rm run within `c.override(mz, clusterd)`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The distroless base lacked shared libraries the persistd binary needs. Feature unification compiles mz-persist with the same heavy dependency set as environmentd (FoundationDB, plus xz/lzma via arrow/parquet), so persistd needs libfdb_c.so, liblzma.so.5, and friends at load time. Bundling each by hand was whack-a-mole (libfdb_c.so, then liblzma.so.5, ...). Switch from distroless-prod-base to prod-base, which provides the full Ubuntu runtime via ubuntu-base, matching environmentd and clusterd. Drops the manual libfdb_c.so copy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The terminate-on-missing-schema path returned an error that bubbled up to a panic! in main. A panic prints a backtrace (slow, and RUST_BACKTRACE=1 is set in the image) and is counted as a failure by the test harness even though the exit is intentional. Log the cause and std::process::exit(1) directly from the schema-lost branch. The supervisor still restarts persistd, and open() recreates the schema, but without the panic and backtrace. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

When the backing store is reset out from under a running persistd (test harnesses wipe the metadata store between phases; a fresh store is provisioned), the consensus relation disappears and every op fails with SQLSTATE 42P01. The previous behavior exited the process for a supervised restart, but harnesses strand the persistd companion faster than restart + health-wait can recover, cascading into "materialized is unhealthy" failures (pg-cdc / mysql-cdc source-versioning migration, feature-benchmark). Recover in place instead: SelfHealingConsensus wraps the backend and, on an undefined-table error, re-opens consensus (re-running the idempotent CREATE SCHEMA/TABLE IF NOT EXISTS DDL) and retries. The reopened handle is discarded; the original pool finds the recreated table on retry. This is local to the standalone binary — the in-process committer in environmentd must never silently recreate a missing schema, which would mask data loss. This removes the need for the terminate-and-restart scaffolding, so revert the persistd restart policy and the feature-benchmark per-scenario teardown. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

In k8s the persist committer runs inside environmentd; clusterds in their own pods must reach it over the network. The committer Service and the --persist-committer-url / --internal-persist-committer-listen-addr args existed but were gated behind an opt-in port that defaulted to unset, so envd fell back to its loopback committer URL and threaded that to clusterds. They got localhost:6882 (their own empty pod), every persist read was refused, and 0dt rollouts never reached ReadyToPromote. Mirror the pubsub wiring: default the committer port (6882) and pass the routable per-generation headless Service DNS as the committer URL, so envd listens on 0.0.0.0 and clusterds dial the Service. Gate the args and Service on meets_minimum_version(V26_27_0_DEV0) — the version that introduced the committer flags — so an upgrade from an older envd image (which would exit on the unknown flags) stays safe. Replaces the opt-in Option<port> with a version check, matching how other envd args are gated across upgrades. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Rebase onto main moved the workspace to 26.28.0-dev.0. Bump persistd's hardcoded version to match, and move the committer version gate from 26.27.0-dev.0 to 26.28.0-dev.0 — the dev version the feature now lands in — so a 26.27 image (released without the committer flags) is correctly excluded during upgrades. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

antiguru force-pushed the persist-committer-design branch 3 times, most recently from 3e2222a to c61e13b Compare May 29, 2026 02:35

antiguru and others added 27 commits May 30, 2026 07:21

persist-committer: scaffold new crate

a7f4226

Skeleton crate with empty submodules. Functionality lands in subsequent commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

persist-committer: define ProtoPersistConsensus service

d84a7bd

Adds the proto definition for Head/Scan/CompareAndSet/Truncate/Subscribe and wires the prost+tonic build script for the crate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

persist-committer: bin/fmt cleanups

ea13f1e

Reformatted the proto with buf and the Python check with black/ruff. Pulls in a Cargo.lock update that was stale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

TEMP DO NOT MERGE: default persist_consensus_use_committer to true

6b02944

For staging exercise only. Revert before merging the PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

antiguru and others added 26 commits May 30, 2026 07:21

ci: include persistd in DEBUGINFO_BINS for symbol upload

2d77e3f

antiguru force-pushed the persist-committer-design branch from 068471c to 227abac Compare May 30, 2026 11:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

persist-committer: in-envd consensus committer service#36762

persist-committer: in-envd consensus committer service#36762
antiguru wants to merge 53 commits into
MaterializeInc:mainfrom
antiguru:persist-committer-design

antiguru commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

antiguru commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant