Skip to content

2.3.0 rc#5114

Draft
NathanFlurry wants to merge 57 commits into
mainfrom
2.3.0-rc
Draft

2.3.0 rc#5114
NathanFlurry wants to merge 57 commits into
mainfrom
2.3.0-rc

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

  • fix(rivetkit): exit pid1 after signal shutdown
  • fix(rivetkit): use engine actor stop threshold for shutdown
  • test(depot-client): stale vfs cache reads fail closed
  • test(depot-client): head fence read poisons vfs
  • test(depot-client): vfs stale page cache writer
  • test(depot-client): delayed read ahead stale pages
  • test(depot-client): startup preload stale pages
  • test(rivetkit-core): sqlite lifecycle fuzz harness
  • chore(kitchen-sink): agent load test
  • test(depot-client): batch atomic cap repro
  • test(depot-client): warm pidx stale read rmw repro
  • test(depot-client): natural warm pidx repro
  • test(depot-client): natural reopen warm pidx repro
  • [SLOP(claude-opus-4-7)] feat(envoy-client): add observability metrics for ws transport and sqlite request lifecycle
  • [SLOP(claude-opus-4-7)] fix(envoy-client): emit Stopped(Error) on lost-timeout to prevent silent destroy
  • Fix actor lost on envoy-client
  • DO NOT MERGE: serverless restart race condition
  • fix(rivetkit): use engine actor stop threshold for shutdown
  • test(kitchen-sink): sigterm sleep probe fixtures
  • feat(kitchen-sink): rust counter-latency harness
  • chore(kitchen-sink): refresh bench + smoke scripts
  • chore(kitchen-sink): counter actor + sigterm probe tweaks
  • chore(envoy-client): trace websocket backpressure
  • feat(envoy-client): add EnvoyStatusHandle wrapper
  • feat(rivetkit-core): wire EnvoyStatusHandle into dispatcher
  • feat(rivetkit-core): expose envoy status through /metrics
  • feat(rivetkit-napi): expose actorStopThresholdMs + envoy-aware health/metrics
  • feat(rivetkit-core): record connection close reason + lifetime metrics
  • feat(kitchen-sink): ws-ping fast-path on tunnel-stress + load-test-agent
  • Add debugging
  • fix(pegboard): add actor-scoped generation key for sqlite fencing
  • Revert "fix(pegboard): add actor-scoped generation key for sqlite fencing"
  • Cargo fmt
  • Fix actor generation validation for sqlite
  • [SLOP(claude-sonnet-4-5)] feat(metrics): add envoy lifecycle, stop reason, ws traffic, and js runtime metrics
  • [SLOP(claude-sonnet-4-5)] chore(logs): promote actor stop logs to info
  • [SLOP(claude-sonnet-4-5)] chore(logs): improve actor stop and envoy ping diagnostics
  • Remove slop
  • chore(kitchen-sink): add rivet cloud deploy workflow
  • [SLOP(gpt-5)] fix(rivetkit): reject comma-joined serverless endpoint header
  • [SLOP(gpt-5)] fix(rivetkit): disable cached serverless envoy by default
  • [SLOP(gpt-5)] fix(rivetkit): warn on cached serverless envoy regional mismatch
  • [SLOP(gpt-5)] docs(rivetkit): record performance audit notes
  • [SLOP(gpt-5)] test(envoy-client): update SharedContext fixtures for websocket diagnostics
  • [RIVETER(rivetkit-perf-fixes-4lv24k3r,[SVC-2555] Set up issue templates #1,gpt-5.5)] chore: perf(envoy-client): convert StdMutex SharedContext fields to scc
  • chore(kitchen-sink): update deployment diagnostics wiring
  • [RIVETER(rivetkit-perf-fixes-4lv24k3r,[SVC-2479] Send cluster events to PostHog #2,gpt-5.5)] chore: perf(envoy-client): replace ws_tx tokio Mutex with ArcSwapOption on hot path
  • [RIVETER(rivetkit-perf-fixes-4lv24k3r,[SVC-2504] Fix 5 GB upload limit for local development from Cloudflare #3,gpt-5.5)] chore: perf(envoy-client): replace BufferMap String keys with u64/[u8;8]
  • [RIVETER(rivetkit-perf-fixes-4lv24k3r,[SVC-2483] Remove hardcoded uses of rivet.gg #4,gpt-5.5)] chore: perf(rivetkit-core): sample record_inbox_depths instead of every loop iteration
  • [RIVETER(rivetkit-perf-fixes-4lv24k3r,[SVC-2358] Enable Redis memory overcommit #5,gpt-5.5)] chore: fix(rivetkit): repair setInterval missing-delay bug in actor-conn keepalive
  • perf(rivetkit-core): tighten queue_metadata lock around enqueue
  • perf(rivetkit-core, envoy-client): convert scc sync methods to async in async contexts
  • perf(envoy-client, guard): enable TCP_NODELAY by default + expose ws_tx_depth metric
  • Add gradual shutdown for load test
  • Fix actor stopped restart

NathanFlurry and others added 30 commits May 24, 2026 11:00
… for ws transport and sqlite request lifecycle
chore(envoy-client): init new backpressure tracking fields
NathanFlurry and others added 21 commits May 24, 2026 11:27
…client): convert StdMutex<HashMap> SharedContext fields to scc
…client): replace ws_tx tokio Mutex with ArcSwapOption on hot path
…client): replace BufferMap String keys with u64/[u8;8]
…it-core): sample record_inbox_depths instead of every loop iteration
…t): repair setInterval missing-delay bug in actor-conn keepalive
@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 28, 2026

🚅 Deployed to the rivet-pr-5114 environment in rivet-frontend

Service Status Web Updated (UTC)
website ✅ Success (View Logs) Web May 30, 2026 at 3:02 am
frontend-cloud 😴 Sleeping (View Logs) Web May 30, 2026 at 2:55 am
frontend-inspector ❌ Build Failed (View Logs) Web May 28, 2026 at 9:44 pm
kitchen-sink ❌ Build Failed (View Logs) Web May 28, 2026 at 9:42 pm
mcp-hub ✅ Success (View Logs) Web May 28, 2026 at 9:35 pm
ladle ✅ Success (View Logs) Web May 28, 2026 at 3:29 am

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 28, 2026

Code Review: PR #5114 - 2.3.0 rc

Large release candidate (128 files, +32k/-945) spanning observability, performance, correctness, and testing.


Overview

  • Observability: New metrics for actor stop causality, WS traffic shape, envoy lifecycle, VFS close summaries
  • Performance: StdMutex/HashMap to scc::HashMap in SharedContext, ws_tx to ArcSwapOption, BufferMap key String to u8-8, inbox-depth sampling
  • Correctness: Lost-timeout emits Stopped(Error), PID1 signal fix, serverless envoy cache disabled by default, comma-joined endpoint rejection
  • Testing: VFS stale-page, batch-atomic, SQLite corruption fuzz harness, SIGTERM probe

Issues

1. "DO NOT MERGE" commit in PR body

The PR description includes "DO NOT MERGE: serverless restart race condition" as one of the commits. Please confirm this was addressed or reverted before merging.

2. actor.rs - Lost-timeout sends SleepIntent; metrics will misattribute

The behavioral fix is correct: sending SleepIntent prevents Pegboard from destroying the actor on a transient WS flap. However, stop_actor_reason_label(SleepIntent) emits "sleep_intent", so all lost-timeout stops appear in the sleep_intent metrics bucket with StopCode::Error. This is confusing in dashboards. Consider a dedicated "lost_timeout" label, or at minimum a comment documenting the intentional mismatch.

3. conn.rs - actor_started_at and actor_stop_meta grow unbounded

record_actor_start and record_actor_stop_dispatch insert into these maps but there is no corresponding cleanup after a close is observed. For a long-lived Conn with many actors rotating through, entries accumulate indefinitely. Add cleanup at the actor close/remove path after stop metrics are emitted.

4. vfs.rs - tracing::debug promoted to tracing::info on hot VFS paths

The "vfs get_pages fetch" and commit log events now fire at info on every page-fetch batch and every database commit. This will generate very high log volume in production. Keep these at debug, consistent with the existing tracing::enabled!(tracing::Level::DEBUG) guards in the same file.

5. Duplicate stop_reason_label in actor_lifecycle.rs and actor.rs

Both files define an identical StopActorReason -> &'static str mapping. This should live once in a shared location to prevent diverging over time.


Non-blocking Observations

WsTxMessage::Send struct expansion (context.rs): The Send variant grew from { data: Vec<u8> } to 8 fields including timestamps, IDs, and flags. The observability value is real, but this is on the WS hot path. Worth profiling under load.

is_ping_healthy refactor (handle.rs): Clean refactoring to last_ping_at_ms().is_some_and(...). Behavior is unchanged since last_ping_ts is initialized to now_millis() (non-zero). The None-means-never-pinged guard is a good clarification.

PID1 finishShutdownSignal fix (registry/index.ts): process.exit(130) / process.exit(143) for PID1 instead of re-raising. Correct fix for containers where default signal handlers are absent.

actorStopThresholdMs for shutdown grace (registry/index.ts): Grace period shifts from hardcoded 30s to the engine-provided actor_stop_threshold_ms (defaulting to 30 minutes). Aligns TS-side drain with Pegboard's hard cutoff. Fallback and error handling are appropriate.

serverless_cache_envoy: false (registry/envoy_callbacks.rs): Correct default. The current protocol cannot authenticate per-request envoy reuse. The opt-in regional-mismatch warning is a good addition.

BufferMap key String to [u8; 8] (utils.rs): Cleaner and faster than the cyrb53 hash string. tunnel_request_key makes composite key construction explicit. The display_id helper avoids heap allocation on the hot log path.

scc::HashMap migration in SharedContext (context.rs, envoy.rs): Correct per project conventions. The remove_if_async using Arc::ptr_eq to avoid evicting a concurrently-inserted replacement entry is a good correctness detail.

Preload limit increases (optimization_flags.rs): MAX_STARTUP_PRELOAD_MAX_BYTES 8 MB to 64 MB and MAX_STARTUP_PRELOAD_FIRST_PAGE_COUNT 256 to 16,384 are caps on user-configured values, not defaults. A 64 MB ceiling could still cause startup memory pressure; worth documenting the expected envelope.


Minor / Style

  • envoy.rs (envoy_loop): #[allow(unused_assignments)] on branch - most select! arms write the variable but observe_envoy_loop_iteration is only called on break paths, making most assignments genuinely unused. Either observe at the end of every iteration or drop the variable from non-break paths.
  • serverless.rs: The trust-boundary block comment uses embedded section headers inside source. Per CLAUDE.md, inline comments should be normal sentences; this documentation belongs in docs-internal/ or a reference doc.
  • actor_lifecycle.rs: ActorStopMeta.dispatched_at uses std::time::Instant. Confirm this is intentional (pegboard-envoy context) vs crate::time::Instant from rivetkit-core.

Summary: The correctness and performance work is solid. Three items need attention before merge: (1) audit the "DO NOT MERGE" commit, (2) misleading metric label for lost-timeout actors appearing as sleep_intent, and (3) unbounded growth of actor_started_at / actor_stop_meta in Conn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants