-
Notifications
You must be signed in to change notification settings - Fork 55
Reject stale cached snapshot after write #902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| ## Problem | ||
|
|
||
| `getDocument()` uses a look-aside cache that is invalidated by **purging**. The purge is **not atomic with the committing transaction**, so under concurrency a reader that missed the cache can read the pre-commit row and re-`save` it into the cache **after** the writer's purge — leaving a stale entry that subsequent reads serve until TTL. | ||
|
|
||
| The existing `forUpdate` bypass (added in `ForUpdateCacheTest`) only protects the **locking read inside `updateDocument`**. A normal `getDocument()` still serves the stale snapshot — `testForUpdateReadBypassesStaleCache` even asserts this. That remaining read-path window is the root cause of **flaky project-config E2E tests**: `ProjectsConsoleClientTest::testGetProject` issues ~12 rapid read-modify-write PATCHes against a single project document and intermittently reads back old `auths` / `services` values. | ||
|
|
||
| ## Evidence | ||
|
|
||
| Runtime trace under Swoole (one coroutine per request, `upd` = `$updatedAt`): | ||
|
|
||
| ``` | ||
| [20] SAVE upd=51.030 coro 20 caches the row it read at request-init | ||
| [20] WRITE upd=51.085 coro 20 commits a NEWER version | ||
| [20] PURGE in-transaction purge | ||
| [20] PURGE post-commit purge | ||
| [21] HIT upd=51.030 coro 21 served STALE 51.030 from cache | ||
| ``` | ||
|
|
||
| In a real cloud E2E environment, `testGetProject` failed ~19 of 20 runs at varying assertions before the fix, and **20/20 after**. The failures were field-agnostic (`authPasswordDictionary`, `authPersonalDataCheck`, `authSessionsLimit` reverting to its default `0`, service flags, list sizes) — consistent with whichever write lost the cache race on a given run. | ||
|
|
||
| ## Fix | ||
|
|
||
| - On every committed `updateDocument` / `deleteDocument`, record the committed `$updatedAt` in a **sibling cache key** (`:__ver`) that `purgeCachedDocument()` does **not** delete, so it outlives the purge of the document body. | ||
| - On read, discard any cached snapshot whose `$updatedAt` predates that marker and reload from the adapter. | ||
| - The guard is **skipped** when no marker exists or the timestamp can't be parsed, so behaviour is unchanged outside the race. `cacheVersionStamp()` normalises `string` / `DateTimeInterface` / Mongo `UTCDateTime` values to a comparable microsecond timestamp. | ||
|
|
||
| ## Tests | ||
|
|
||
| - New deterministic regression test `ForUpdateCacheTest::testReadRejectsStaleCacheSnapshotReCachedAfterUpdate` — **fails without the fix** (`'stale'`), passes with it. It reproduces the race at the cache layer (re-`save` of a pre-commit snapshot after a committed update) without needing real concurrency. | ||
| - Full unit suite: **367 tests, 2160 assertions — OK**. Pint PSR-12 ✅, PHPStan level 7 ✅. | ||
|
|
||
| ## Notes / open questions | ||
|
|
||
| - The `:__ver` marker relies on `$updatedAt` advancing per write; with `preserveDates` enabled or sub-microsecond updates two versions could tie (comparison is strict `<`, so a tie keeps today's behaviour). Could switch to a monotonic counter if preferred. | ||
| - Marker reads use `self::TTL`, so the marker expires in lockstep with the document cache entry it guards. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -174,6 +174,10 @@ class Database | |
| // Cache | ||
| public const TTL = 60 * 60 * 24; // 24 hours | ||
|
|
||
| // Suffix for the per-document cache key that records the latest committed | ||
| // version, used to reject stale snapshots re-cached by racing readers. | ||
| public const CACHE_VERSION_SUFFIX = '__ver'; | ||
|
|
||
| // Events | ||
| public const EVENT_ALL = '*'; | ||
|
|
||
|
|
@@ -4852,6 +4856,28 @@ public function getDocument(string $collection, string $id, array $queries = [], | |
| } | ||
| } | ||
|
|
||
| // Reject a cached snapshot that predates the latest committed write. Cache | ||
| // invalidation is a purge, which is not atomic with the committing | ||
| // transaction: a concurrent reader can re-cache the pre-commit row after a | ||
| // writer's purge, leaving a stale entry that later reads would serve. Each | ||
| // write records the committed $updatedAt in a sibling marker key; discard | ||
| // any cached copy older than it so the stale entry cannot be served. | ||
| if ($cached !== null && isset($cached['$updatedAt'])) { | ||
| $marker = null; | ||
| try { | ||
| $marker = $this->cache->load($documentKey . ':' . self::CACHE_VERSION_SUFFIX, self::TTL); | ||
| } catch (Exception $e) { | ||
| Console::warning('Warning: Failed to get cache version marker: ' . $e->getMessage()); | ||
| } | ||
|
|
||
| if (\is_string($marker) && $marker !== '') { | ||
| $cachedVersion = $this->cacheVersionStamp($cached['$updatedAt']); | ||
| if ($cachedVersion !== null && (float) $cachedVersion < (float) $marker) { | ||
| $cached = null; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if ($cached) { | ||
| $document = $this->createDocumentInstance($collection->getId(), $cached); | ||
|
|
||
|
|
@@ -6386,6 +6412,9 @@ public function updateDocument(string $collection, string $id, Document $documen | |
|
|
||
| // Purge again after commit so readers cannot re-cache the pre-commit version | ||
| $this->purgeCachedDocumentInternal($collection->getId(), $id); | ||
| // Record the committed version so a stale snapshot re-cached by a racing | ||
| // reader is rejected on subsequent reads (see getDocument). | ||
| $this->recordCachedDocumentVersion($collection->getId(), $id, $document->getUpdatedAt()); | ||
|
Comment on lines
6414
to
+6417
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When
Comment on lines
+6415
to
+6417
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Apply the version marker to every document write path. This only covers Also applies to: 7807-7809 🤖 Prompt for AI Agents |
||
|
|
||
| if (!$this->inBatchRelationshipPopulation && $this->resolveRelationships) { | ||
| $documents = $this->silent(fn () => $this->populateDocumentsRelationships([$document], $collection, $this->relationshipFetchDepth)); | ||
|
|
@@ -7775,6 +7804,9 @@ public function deleteDocument(string $collection, string $id): bool | |
| if ($deleted) { | ||
| // Purge again after commit so readers cannot re-cache the pre-commit version | ||
| $this->purgeCachedDocumentInternal($collection->getId(), $id); | ||
| // Advance the version marker past any snapshot so a stale "exists" copy | ||
| // re-cached by a racing reader is rejected on subsequent reads. | ||
| $this->recordCachedDocumentVersion($collection->getId(), $id, DateTime::now()); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The delete path records |
||
| $this->trigger(self::EVENT_DOCUMENT_DELETE, $document); | ||
| } | ||
|
|
||
|
|
@@ -8380,6 +8412,70 @@ protected function purgeCachedDocumentInternal(string $collectionId, ?string $id | |
| return true; | ||
| } | ||
|
|
||
| /** | ||
| * Record the latest committed version of a document in the cache. | ||
| * | ||
| * Stored in a sibling key (CACHE_VERSION_SUFFIX) that purgeCachedDocument() | ||
| * does not delete, so it outlives the purge of the document body. getDocument() | ||
| * compares a cached snapshot's $updatedAt against this marker and discards any | ||
| * snapshot older than it — closing the window where a reader re-caches a | ||
| * pre-commit row after a writer's purge. | ||
| * | ||
| * @param string $collectionId | ||
| * @param string $id | ||
| * @param string|\Utopia\Database\DateTime|null $updatedAt committed $updatedAt | ||
| * @return void | ||
| */ | ||
| protected function recordCachedDocumentVersion(string $collectionId, string $id, mixed $updatedAt): void | ||
| { | ||
| $stamp = $this->cacheVersionStamp($updatedAt); | ||
| if ($stamp === null) { | ||
| return; | ||
| } | ||
|
|
||
| [, $documentKey] = $this->getCacheKeys($collectionId, $id); | ||
|
|
||
| try { | ||
| $this->cache->save($documentKey . ':' . self::CACHE_VERSION_SUFFIX, $stamp); | ||
| } catch (Exception $e) { | ||
| Console::warning('Warning: Failed to record cache version marker: ' . $e->getMessage()); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Normalise a document $updatedAt value (string, DateTime, or Mongo UTCDateTime) | ||
| * into a comparable microsecond-precision UNIX timestamp string, or null when | ||
| * it cannot be determined (in which case the version guard is skipped). | ||
| * | ||
| * @param mixed $updatedAt | ||
| * @return string|null | ||
| */ | ||
| private function cacheVersionStamp(mixed $updatedAt): ?string | ||
| { | ||
| if (empty($updatedAt)) { | ||
| return null; | ||
| } | ||
|
|
||
| try { | ||
| if ($updatedAt instanceof \DateTimeInterface) { | ||
| return $updatedAt->format('U.u'); | ||
| } | ||
|
|
||
| // MongoDB\BSON\UTCDateTime and similar value objects. | ||
| if (\is_object($updatedAt) && \method_exists($updatedAt, 'toDateTime')) { | ||
| return $updatedAt->toDateTime()->format('U.u'); | ||
| } | ||
|
|
||
| if (\is_string($updatedAt)) { | ||
| return (new \DateTime($updatedAt))->format('U.u'); | ||
| } | ||
| } catch (\Throwable $e) { | ||
| return null; | ||
| } | ||
|
|
||
| return null; | ||
| } | ||
|
|
||
| /** | ||
| * Cleans a specific document from cache and triggers EVENT_DOCUMENT_PURGE. | ||
| * And related document reference in the collection cache. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the cache marker strictly dominate stale snapshots.
The guard only rejects
cachedVersion < marker, but update writes can preserve/tie$updatedAt, and deletes useDateTime::now()which can be older than a preserved/future$updatedAt. In those cases, a stale re-cached snapshot is not rejected. Use a truly monotonic cache-only version token, or otherwise guarantee the marker is strictly newer than any pre-write snapshot before relying on this comparison.Also applies to: 6415-6417, 7807-7809
🤖 Prompt for AI Agents