CS-10953: cross-process populate coordination for CachingDefinitionLookup#4645
Draft
lukemelia wants to merge 2 commits intocs-10952-cross-process-invalidation-broadcastfrom
Draft
Conversation
Contributor
Host Test Results 1 files ±0 1 suites ±0 1h 44m 4s ⏱️ +36s Results for commit 030d174. ± Comparison against earlier commit bffbfea. Realm Server Test Results 1 files ± 0 1 suites ±0 16m 20s ⏱️ -5s Results for commit 030d174. ± Comparison against earlier commit bffbfea. |
cd00c51 to
bffbfea
Compare
33049fc to
7378223
Compare
Stacks on CS-10952. Adds a `pg_try_advisory_xact_lock` + NOTIFY-wait
coalescing layer between CachingDefinitionLookup's #inFlight coalescer
and the prerenderer, so at most one realm-server process per coalesce
key reaches the prerenderer; peers block on NOTIFY and re-read the
populated row.
* runtime-common/definition-lookup.ts:
- exports MODULE_CACHE_POPULATED_CHANNEL + the PopulateCoordinator
interface (`tryAcquireAndRun`, `waitForKey`)
- CachingDefinitionLookup constructor takes an optional
`populateCoordinator` (5th arg). When provided,
loadModuleCacheEntry routes through a new
loadModuleCacheEntryCoordinated that does an outer
`for COALESCE_MAX_ITERATIONS` loop: optimistic cache read → try
lock via coordinator → on win, run uncoordinated body inside the
lock (the body's existing cache double-check + prerender +
generation-check + persist) → on loss, wait for peer's NOTIFY
(180s timeout) and loop. Throws after MAX_ITERATIONS so a
pathological peer crash-loop or NOTIFY-drop sequence surfaces.
- When no coordinator is provided (default; sqlite/in-memory
deployments; the vast majority of test setups), the original
uncoordinated path runs unchanged.
* realm-server/lib/module-cache-coordination.ts (new):
ModuleCacheCoordinator implements PopulateCoordinator. Mirrors the
withRealmWriteLock pattern but with `pg_try_advisory_xact_lock` (non-
blocking) so losers don't pin pool clients for the duration of a
peer's prerender. `waitForKey` registers a callback on a per-key
Set, the LISTEN handler (PgAdapter.listen on
module_cache_populated) dispatches NOTIFYs into the matching set.
pg_notify is emitted INSIDE the same tx as the lock so peers only
see the signal on commit (the persist itself ran on the shared
dbAdapter and is already visible by then). Always notifies on
success, even when fn returned undefined (all populationCandidates
produced missing-module errors), so peers don't sit on the 180s
timeout for a "no row" outcome — small spec divergence documented
in the file.
* realm-server/main.ts: behind PRERENDER_COALESCE_ACROSS_PROCESSES=true
env flag (default off). When on, constructs + starts a
ModuleCacheCoordinator and passes it to CachingDefinitionLookup.
Added to the shutdown Promise.all alongside the other listeners.
Behavior at N=1: inert. The try-lock always succeeds uncontended; the
loser path is never taken; self-NOTIFY is dropped (no waiters
registered).
Behavior at N>1 (with the flag on): N× prerender-server load reduction
on cold fan-out — 1 prerender per unique module across the whole
fleet instead of N.
Tests forthcoming in a follow-up commit.
Linear: https://linear.app/cardstack/issue/CS-10953
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the lightweight pattern from module-cache-invalidation-listener-test:
real PgAdapter via setupDB, stub prerenderer/virtualNetwork, no realm-server
fixture. Two modules:
* `ModuleCacheCoordinator unit` — exercises the coordinator surface
directly:
- tryAcquireAndRun uncontended → acquired:true, fn runs, peer waiter
sees NOTIFY on commit
- tryAcquireAndRun contended → second caller gets acquired:false
immediately (loser does not pin the pool client)
- waitForKey resolves on NOTIFY before timeout
- waitForKey resolves on timeout when no NOTIFY arrives
- waitForKey ignores NOTIFYs for unrelated keys
- shutDown wakes parked waiters so callers don't hang during teardown
* `CachingDefinitionLookup coordinated path (integration)` — exercises
the wired-up lookup with two instances on one DB:
- concurrent same-module lookup across two instances → exactly one
prerender call total (B parks on NOTIFY, wakes after A persists,
re-reads cache, returns row)
- coordinator-less single instance still works (sqlite/in-memory
deployment guard)
- cache-hit short-circuits before contending the lock (fresh second
instance reading an already-cached row never calls its prerenderer)
Linear: https://linear.app/cardstack/issue/CS-10953
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bffbfea to
030d174
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #4644 (CS-10952). Second sub-issue under the CS-10950 umbrella. Closes the populate-coalescing half of the cross-process coordination story.
Note
This PR's base is
cs-10952-cross-process-invalidation-broadcast(the CS-10952 PR's branch), notmain. Once CS-10952 lands, GitHub will auto-rebase this PR's diff onto main; or rebase manually.Why
CS-10948 added in-process generation discard. CS-10952 broadcasts invalidations across processes. But cross-process populate — coalescing the prerender work itself — still has a gap: each of N realm-server processes independently misses the
modulescache on cold fan-out and fires its own prerender for the same URL. Prerender servers see N× the work they need. Measurable today during from-scratch indexing.This PR adds a
pg_try_advisory_xact_lock+ NOTIFY-wait coalescing layer betweenCachingDefinitionLookup's#inFlightcoalescer and the prerenderer, so at most one realm-server process per coalesce key reaches the prerenderer. Peer processes block on NOTIFY and re-read the populated row.What changes
runtime-common/definition-lookup.tsMODULE_CACHE_POPULATED_CHANNELand a newPopulateCoordinatorinterface (two methods:tryAcquireAndRun(coalesceKey, fn)andwaitForKey(coalesceKey, timeoutMs)).CachingDefinitionLookupconstructor takes an optionalpopulateCoordinator(5th arg). When provided,loadModuleCacheEntryroutes through a newloadModuleCacheEntryCoordinatedthat adds an outerfor COALESCE_MAX_ITERATIONSloop:MAX_ITERATIONS=4so a pathological peer crash-loop or NOTIFY-drop sequence surfaces as an error instead of silently hanging.realm-server/lib/module-cache-coordination.ts(new)ModuleCacheCoordinator implements PopulateCoordinator. Mirrors thewithRealmWriteLockpattern but withpg_try_advisory_xact_lock(non-blocking) so losers don't pin pool clients for the full prerender wall time (could be 150s in production).waitForKeyregisters a callback on a per-keySet; the LISTEN handler (PgAdapter.listenonmodule_cache_populated) dispatches NOTIFYs into the matching set.pg_notifyis emitted inside the same transaction as the lock, so peers only see the signal on commit. The persist itself ran on the shareddbAdapterand is already visible by the time peers re-read on wake.realm-server/main.tsBehind
PRERENDER_COALESCE_ACROSS_PROCESSES=trueenv flag (default off). When enabled, constructs + starts aModuleCacheCoordinatorand passes it toCachingDefinitionLookup. Added to the shutdownPromise.allalongside the other listeners.Two small spec divergences
1. NOTIFY on every winner outcome (including missing-module). The CS-10953 spec says winners don't notify when all populationCandidates produced missing-module errors. We notify regardless. Trade-off: an extra harmless wake for peers vs. a 180 s timeout cycle for parallel callers of a nonexistent URL. Cheap to choose the wake.
2.
ModuleCacheCoordinatorlives inrealm-server/lib/, notruntime-common/. Same reason as CS-10952:runtime-commondoesn't depend on@cardstack/postgres, and adding the dep would be circular (@cardstack/postgresalready depends onruntime-commonviaDBAdapter). ThePopulateCoordinatorinterface is inruntime-common; the implementation is inrealm-server.Behavior
N=1 (today's production): effectively inert. Try-lock always succeeds uncontended; loser path is never taken; self-NOTIFY is dropped (no waiters registered). Overhead is one extra
BEGIN; SELECT pg_try_advisory_xact_lock; pg_notify; COMMITper cache miss — measurable but sub-millisecond. ThePRERENDER_COALESCE_ACROSS_PROCESSESflag is off by default so this overhead doesn't ship to production until we explicitly flip it.N>1 (with the flag on): N× prerender-server load reduction on cold fan-out. 1 prerender per unique module across the whole fleet instead of N.
Test plan
New tests in
realm-server/tests/module-cache-coordination-test.ts:Coordinator unit tests (operate on
ModuleCacheCoordinatordirectly):tryAcquireAndRununcontended →acquired:true, fn runs, peer waiter sees the NOTIFY on commit.tryAcquireAndRuncontended → second caller getsacquired:falseimmediately (does not pin the pool).waitForKeyresolves on NOTIFY before timeout.waitForKeyresolves on timeout when no NOTIFY arrives.waitForKeyignores NOTIFYs for unrelated keys.shutDownwakes parked waiters so callers don't hang during teardown.Integration tests (full
CachingDefinitionLookupwith coordinator on real PgAdapter):What's NOT in scope (per ticket)
#moduleCache.AbortControllercancellation.Related
withRealmWriteLock/hashRealmUrlForAdvisoryLockprimitive that the non-blocking try-lock here extendsrealm_file_changesNOTIFY pattern the populate listener follows🤖 Generated with Claude Code