Add skipDuplicates() scope guard to createDocuments#852
Add skipDuplicates() scope guard to createDocuments#852premtsd-code wants to merge 26 commits intomainfrom
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 30 minutes and 43 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds a scoped Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Database
participant Adapter
participant Storage as DB
rect rgba(100,180,120,0.5)
Note over Client,Storage: New flow with skipDuplicates enabled
Client->>Database: createDocuments(docs) within skipDuplicates scope
Database->>Database: compute tenantKey, dedupe, chunk ids
Database->>Adapter: find(existing ids) [chunked]
Adapter->>DB: SELECT/_find by _uid/_id
DB-->>Adapter: existing ids
Adapter-->>Database: existing ids
Database->>Database: filter out pre-existing docs, prepare inserts
Database->>Adapter: adapter->skipDuplicates(fn => createDocuments(filtered))
Adapter->>DB: INSERT IGNORE / ON CONFLICT DO NOTHING or conditional upserts (Mongo)
DB-->>Adapter: success (duplicates skipped)
Adapter-->>Database: inserted result info
Database->>Database: create relationships for actually inserted docs
Database-->>Client: return inserted count / docs
end
sequenceDiagram
participant Client
participant Database
participant Adapter
participant Storage as DB
rect rgba(200,100,150,0.5)
Note over Client,Storage: Regular flow without skipDuplicates
Client->>Database: createDocuments(docs)
Database->>Adapter: createDocuments(docs)
Adapter->>DB: INSERT INTO ...
DB-->>Adapter: may error on duplicates
Adapter-->>Database: propagate exception or handle normally
Database-->>Client: result or exception
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR adds a
Confidence Score: 3/5One P1 defect (M2M junction duplication on re-run) should be resolved before merging for any re-import use cases involving MANY_TO_MANY relationships. The core mechanism — adapter flags, Pool propagation, Mongo transaction bypass, Mirror divergence prevention — is well-designed and tests are thorough for covered cases. However, src/Database/Database.php (createDocuments pre-processing loop around line 5720), src/Database/Mirror.php (skipDuplicates pre-filter around line 614) Important Files Changed
|
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
tests/e2e/Adapter/Scopes/DocumentTests.php (1)
7815-7818: Prefer__FUNCTION__for collection naming consistency.Using
__FUNCTION__here (as in adjacent tests) improves isolation and avoids accidental name reuse across reruns.♻️ Suggested tweak
- $col = 'createDocsIgnoreIntraBatch'; + $col = __FUNCTION__;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/Adapter/Scopes/DocumentTests.php` around lines 7815 - 7818, The collection name is hard-coded as 'createDocsIgnoreIntraBatch' which can collide across test runs; change the $col assignment to use the current function name (replace the literal with __FUNCTION__) so the calls to $database->createCollection($col) and $database->createAttribute($col, 'name', Database::VAR_STRING, 128, true) use a unique, test-local collection name consistent with adjacent tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/Database/Adapter/Mongo.php`:
- Around line 1501-1524: The current dedupe prefetch uses only _uid and a single
tenant filter, causing cross-tenant collisions for shared tables; update the
logic that builds existingUids (and the prefetch loop around
$this->client->find/$this->client->getMore) to be tenant-aware by using a
composite key (e.g., "$tenant:$uid") when populating $existingUids, include the
tenant in the $filters returned by $this->getTenantFilters($collection->getId())
per chunk, and prefetch in per-tenant chunks using array_chunk($uidValues,
max(1, $this->maxQueryValues)) (use the chunk size when calling getMore instead
of count($uidValues)); then when filtering $records check the composite key
($record['_tenant'] . ':' . ($record['_uid'] ?? '')) against $existingUids so
duplicate detection respects tenant boundaries.
- Around line 1504-1516: The prefetch block calls $this->client->find() and
$this->client->getMore() without exception mapping; wrap each driver call in a
try/catch that catches \Throwable and rethrows the normalized exception via
processException($e) (same pattern used elsewhere in this class), i.e. surround
the $this->client->find(...) call and the $this->client->getMore(...) call with
try { ... } catch (\Throwable $e) { throw $this->processException($e); } so any
raw driver errors from the $response/$more handling are mapped consistently.
In `@src/Database/Database.php`:
- Around line 7210-7255: The upsertDocumentsWithIncrease() flow builds
$existingDocs keyed as tenant:id when $upsertTenantPerDocument is true, but
later duplicate-input detection still uses only $document->getId(), causing
false DuplicateException across tenants; update the duplicate-input check and
any intermediate maps (e.g., $seenIds/$inputKeys) to use the same composite
lookup key (use the existing $lookupKey logic: $tenant . ':' .
$document->getId() when $upsertTenantPerDocument is true) so the tenant-aware
key is reused consistently across the whole upsert path.
---
Nitpick comments:
In `@tests/e2e/Adapter/Scopes/DocumentTests.php`:
- Around line 7815-7818: The collection name is hard-coded as
'createDocsIgnoreIntraBatch' which can collide across test runs; change the $col
assignment to use the current function name (replace the literal with
__FUNCTION__) so the calls to $database->createCollection($col) and
$database->createAttribute($col, 'name', Database::VAR_STRING, 128, true) use a
unique, test-local collection name consistent with adjacent tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c9f1924e-3ed0-43a8-8859-6e6c299aadb5
📒 Files selected for processing (9)
src/Database/Adapter.phpsrc/Database/Adapter/Mongo.phpsrc/Database/Adapter/Pool.phpsrc/Database/Adapter/Postgres.phpsrc/Database/Adapter/SQL.phpsrc/Database/Adapter/SQLite.phpsrc/Database/Database.phpsrc/Database/Mirror.phptests/e2e/Adapter/Scopes/DocumentTests.php
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
src/Database/Database.php (2)
5682-5726:⚠️ Potential issue | 🟠 MajorDon’t let the outer prefetch snapshot suppress writes.
The adapters already do the authoritative duplicate filtering inside the transaction/session, but this code still drops
$chunkentries from a pre-transactionfind(). If a matching row is deleted after that snapshot but before the insert transaction starts,ignore: truewill incorrectly skip a document that should now insert, and$modified/$onNextwill be low.Also applies to: 5805-5815
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/Database/Database.php` around lines 5682 - 5726, The current prefetch logic builds $preExistingIds (when $ignore is true, in the tenantPerDocument and non-tenant branches using find()/authorization->skip/etc.) and later uses that snapshot to skip writes, which can wrongly suppress inserts if a matching row was deleted after the prefetch; remove or stop using this prefetch-based suppression. Instead, eliminate reliance on $preExistingIds to skip document inserts/relationship writes (and the related checks that reference it around the relationship-resolution code and at the later block noted ~5805-5815), and either rely on the adapters’ authoritative duplicate filtering inside the transaction/session or re-check existence inside the transaction. Keep the prefetch only if it’s used for optimizations that do not affect whether an insert occurs, or remove the prefetch entirely if it’s unused for non-authoritative optimizations.
5788-5790:⚠️ Potential issue | 🟠 MajorUse the same tenant-aware key for deferred relationship writes.
In tenant-per-document mode, two inserted documents can legitimately share the same
$id. This map is keyed only by$id, so the later payload overwrites the earlier one, andunset()after the first insert prevents the second tenant’s relationship write entirely.♻️ Proposed fix
if (!empty($relationshipData)) { - $deferredRelationships[$document->getId()] = $relationshipData; + $relationshipKey = $tenantPerDocument + ? $document->getTenant() . ':' . $document->getId() + : $document->getId(); + $deferredRelationships[$relationshipKey] = $relationshipData; } @@ foreach ($batch as $insertedDoc) { - $docId = $insertedDoc->getId(); - if (\array_key_exists($docId, $deferredRelationships)) { + $relationshipKey = $tenantPerDocument + ? $insertedDoc->getTenant() . ':' . $insertedDoc->getId() + : $insertedDoc->getId(); + if (\array_key_exists($relationshipKey, $deferredRelationships)) { $relDoc = clone $insertedDoc; - foreach ($deferredRelationships[$docId] as $key => $value) { + foreach ($deferredRelationships[$relationshipKey] as $key => $value) { $relDoc->setAttribute($key, $value); } $this->silent(fn () => $this->createDocumentRelationships($collection, $relDoc)); - unset($deferredRelationships[$docId]); + unset($deferredRelationships[$relationshipKey]); } }Also applies to: 5823-5832
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/Database/Database.php` around lines 5788 - 5790, The deferred relationship map ($deferredRelationships) is keyed only by $document->getId(), which causes collisions in tenant-per-document mode; change the key to be tenant-aware (e.g., combine tenant id and document id or use the document's tenant identifier method) wherever entries are set and removed so each tenant/document pair is unique—update the code that assigns $deferredRelationships[$document->getId()] = $relationshipData; and the matching unset() logic (and the same pattern around the related block at lines referenced 5823-5832) to use the composite tenant-aware key instead of just getId().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/Database/Adapter/SQL.php`:
- Around line 2482-2527: The pre-filter SELECT that checks for existing _uid
rows (the block building $sql, preparing via $this->getPDO()->prepare, binding
values and calling $stmt->execute()/fetchAll()) can throw a raw PDOException
because it runs outside the method's main try/catch; wrap the entire
prepare/execute/fetch block in the adapter's processException(...) wrapper so
exceptions are mapped consistently (i.e., call $this->processException(...)
around the closure that prepares, binds, executes and fetches, returning $rows),
keeping the logic inside createDocuments() and referencing the existing $sql,
$binds, $stmt and $rows variables.
- Around line 2665-2713: The reconciliation currently keys timestamps only by
_uid and uses a single :_vfy_tenant bind when $this->sharedTables is true, which
corrupts mixed-tenant batches; update the verification to be tenant-aware by
binding the tenant per document and keying by a composite tenant+uid.
Concretely, in the block that builds $verifyPlaceholders/$verifyBinds and
$verifySql (the code using $verifyPlaceholders, $verifyBinds, $verifySql and
$verifyStmt), include the tenant value for each document (use the document's
tenant when tenant-per-document applies), change the WHERE to match pairs (e.g.
use row-value tuples or explicit (_uid = :uX AND _tenant = :tX) clauses), and
then build $actualTimestamps keyed by "$tenant|$uid" (and compare against
$expectedTimestamps keyed the same way) so the subsequent loop that populates
$insertedDocs/$insertedUids only matches rows with the correct tenant+uid.
- Around line 2650-2652: Replace the use of "INSERT IGNORE" (returned by
getInsertKeyword) with a plain "INSERT" and remove any IGNORE-specific behavior
in getInsertSuffix so that constraint violations are not globally suppressed;
instead let the DB raise errors and update the insert/execute path that catches
SQL exceptions to inspect the database error/constraint name (e.g., index or key
name such as _uid) and map errors to DuplicateException only when the violated
constraint corresponds to the _uid duplicate, otherwise map to UniqueException
(or other specific exceptions) for PRIMARY/UNIQUE/NOT NULL/truncation
violations—use the existing functions getInsertKeyword and getInsertSuffix to
locate changes and adjust the exception-mapping in the code that performs the
insert and error handling.
---
Duplicate comments:
In `@src/Database/Database.php`:
- Around line 5682-5726: The current prefetch logic builds $preExistingIds (when
$ignore is true, in the tenantPerDocument and non-tenant branches using
find()/authorization->skip/etc.) and later uses that snapshot to skip writes,
which can wrongly suppress inserts if a matching row was deleted after the
prefetch; remove or stop using this prefetch-based suppression. Instead,
eliminate reliance on $preExistingIds to skip document inserts/relationship
writes (and the related checks that reference it around the
relationship-resolution code and at the later block noted ~5805-5815), and
either rely on the adapters’ authoritative duplicate filtering inside the
transaction/session or re-check existence inside the transaction. Keep the
prefetch only if it’s used for optimizations that do not affect whether an
insert occurs, or remove the prefetch entirely if it’s unused for
non-authoritative optimizations.
- Around line 5788-5790: The deferred relationship map ($deferredRelationships)
is keyed only by $document->getId(), which causes collisions in
tenant-per-document mode; change the key to be tenant-aware (e.g., combine
tenant id and document id or use the document's tenant identifier method)
wherever entries are set and removed so each tenant/document pair is
unique—update the code that assigns $deferredRelationships[$document->getId()] =
$relationshipData; and the matching unset() logic (and the same pattern around
the related block at lines referenced 5823-5832) to use the composite
tenant-aware key instead of just getId().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d2ab610b-bb64-4dcb-979c-f3880d9e99df
📒 Files selected for processing (3)
src/Database/Adapter/Mongo.phpsrc/Database/Adapter/SQL.phpsrc/Database/Database.php
🚧 Files skipped from review as they are similar to previous changes (1)
- src/Database/Adapter/Mongo.php
There was a problem hiding this comment.
♻️ Duplicate comments (3)
src/Database/Adapter/SQL.php (3)
2482-2527:⚠️ Potential issue | 🟠 MajorWrap the ignore-mode pre-filter in
processException()handling.This query runs before the method’s main
try/catch, so SQL or connection failures here still escape as rawPDOExceptionwhile the rest ofcreateDocuments()is adapter-mapped.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/Database/Adapter/SQL.php` around lines 2482 - 2527, The pre-filter block in createDocuments() that prepares and executes the SELECT (using getSQLTable(), getPDO(), getPDOType(), preparing $sql and executing $stmt) should be wrapped inside a call to processException() so any PDOException is adapter-mapped instead of escaping; move the existing logic that builds $placeholders/$binds, $tenantFilter, $sql, $stmt->bindValue(), $stmt->execute(), $rows = $stmt->fetchAll() and $stmt->closeCursor() into a closure passed to processException(), and return/assign the result from that closure (or rethrow via processException) so behavior matches the rest of createDocuments().
2782-2785:⚠️ Potential issue | 🟠 MajorDon’t use
INSERT IGNOREas the base implementation forignore=true.For adapters that inherit this default,
ignore=truestops meaning “skip duplicate document IDs” and starts suppressing a broader class of insert failures on both the document and permission insert paths. The base hook should stay plainINSERT INTO, and adapter-specific duplicate handling should be expressed with targeted conflict clauses or constraint-aware exception mapping.Does MariaDB/MySQL `INSERT IGNORE` suppress constraint violations beyond duplicate-key conflicts, such as PRIMARY KEY, other UNIQUE, NOT NULL, truncation, or conversion errors?Based on learnings, in
src/Database/Adapter/MariaDB.phponly duplicate_uidviolations should throwDuplicateException; other unique orPRIMARYcollisions should still throwUniqueException.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/Database/Adapter/SQL.php` around lines 2782 - 2785, The base method getInsertKeyword in class SQL should not return 'INSERT IGNORE INTO' for ignore=true; change it to always return the plain 'INSERT INTO' and move any ignore semantics into adapter-specific code (e.g., Adapter/MariaDB.php) by using targeted conflict clauses or by mapping constraint/SQLSTATE errors to the correct exceptions; specifically, in MariaDB adapter ensure only duplicate violations on the document _uid map to DuplicateException while other UNIQUE/PRIMARY/constraint failures map to UniqueException (and do not rely on INSERT IGNORE in SQL::getInsertKeyword).
2664-2728:⚠️ Potential issue | 🔴 CriticalKeep
expectedTimestampstenant-aware too.
$actualTimestampsis keyed bytenant:uidnow, but$expectedTimestampsis still keyed only by_uid. IntenantPerDocumentmode, two documents with the same_uidin different tenants can overwrite each other here, causing an inserted document to be dropped from$documentsand miss its permission insert.💡 Proposed fix
- $expectedTimestamps = []; - foreach ($documents as $doc) { - $expectedTimestamps[$doc->getId()] = $doc->getCreatedAt(); - } + $expectedTimestamps = []; + foreach ($documents as $doc) { + $key = ($this->sharedTables && $this->tenantPerDocument) + ? $doc->getTenant() . ':' . $doc->getId() + : $doc->getId(); + $expectedTimestamps[$key] = $doc->getCreatedAt(); + } @@ - if (isset($actualTimestamps[$key]) && $actualTimestamps[$key] === $expectedTimestamps[$doc->getId()]) { + if (isset($actualTimestamps[$key]) && $actualTimestamps[$key] === $expectedTimestamps[$key]) { $insertedDocs[] = $doc; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/Database/Adapter/SQL.php` around lines 2664 - 2728, The expectedTimestamps map is not tenant-aware while actualTimestamps is, so when sharedTables && tenantPerDocument you must key expectedTimestamps the same way as actualTimestamps (tenant + ':' + uid) during the initial foreach over $documents and also use that key when comparing later (instead of $expectedTimestamps[$doc->getId()]); update the loop that builds $expectedTimestamps and the comparison that checks equality to construct/look up keys as ($doc->getTenant() . ':' . $doc->getId()) when $this->sharedTables && $this->tenantPerDocument so entries don't collide across tenants.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/Database/Adapter/SQL.php`:
- Around line 2482-2527: The pre-filter block in createDocuments() that prepares
and executes the SELECT (using getSQLTable(), getPDO(), getPDOType(), preparing
$sql and executing $stmt) should be wrapped inside a call to processException()
so any PDOException is adapter-mapped instead of escaping; move the existing
logic that builds $placeholders/$binds, $tenantFilter, $sql, $stmt->bindValue(),
$stmt->execute(), $rows = $stmt->fetchAll() and $stmt->closeCursor() into a
closure passed to processException(), and return/assign the result from that
closure (or rethrow via processException) so behavior matches the rest of
createDocuments().
- Around line 2782-2785: The base method getInsertKeyword in class SQL should
not return 'INSERT IGNORE INTO' for ignore=true; change it to always return the
plain 'INSERT INTO' and move any ignore semantics into adapter-specific code
(e.g., Adapter/MariaDB.php) by using targeted conflict clauses or by mapping
constraint/SQLSTATE errors to the correct exceptions; specifically, in MariaDB
adapter ensure only duplicate violations on the document _uid map to
DuplicateException while other UNIQUE/PRIMARY/constraint failures map to
UniqueException (and do not rely on INSERT IGNORE in SQL::getInsertKeyword).
- Around line 2664-2728: The expectedTimestamps map is not tenant-aware while
actualTimestamps is, so when sharedTables && tenantPerDocument you must key
expectedTimestamps the same way as actualTimestamps (tenant + ':' + uid) during
the initial foreach over $documents and also use that key when comparing later
(instead of $expectedTimestamps[$doc->getId()]); update the loop that builds
$expectedTimestamps and the comparison that checks equality to construct/look up
keys as ($doc->getTenant() . ':' . $doc->getId()) when $this->sharedTables &&
$this->tenantPerDocument so entries don't collide across tenants.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 892638bd-ac20-414b-9059-7f7e20295028
📒 Files selected for processing (2)
src/Database/Adapter/SQL.phpsrc/Database/Database.php
🚧 Files skipped from review as they are similar to previous changes (1)
- src/Database/Database.php
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/Database/Adapter/SQL.php`:
- Around line 2676-2682: The verification query is binding composite keys from
$expectedTimestamps into $verifyBinds/placeholders and then comparing them
against the _uid column, but when tenantPerDocument is enabled _uid was set to
the raw document ID ($attributes['_uid'] = $document->getId()), so the WHERE
_uid IN (...) must use the document IDs not the composite keys. Fix by
extracting the document ID portion from each composite key before creating $ph
and adding to $verifyBinds (or if keys are already plain IDs, leave unchanged);
update the loop that builds $verifyPlaceholders/$verifyBinds (the foreach over
array_values(array_unique(array_keys($expectedTimestamps)))) to derive $uid =
<documentIdFromComposite($uid)> and bind that instead so the verification query
matches rows by actual _uid values.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 43af96a2-d479-4354-86c9-7d68abad7b56
📒 Files selected for processing (1)
src/Database/Adapter/SQL.php
Adapters: - MariaDB/MySQL: INSERT IGNORE INTO - PostgreSQL: ON CONFLICT DO NOTHING - SQLite: INSERT OR IGNORE INTO - MongoDB: session-scoped pre-filter before insertMany Database.php: - Intra-batch dedup by ID (tenant-aware, first occurrence wins) - Pre-fetch existing IDs to skip known duplicates - Deferred relationship creation for ignore mode (no orphans) - Race-condition reconciliation via _createdAt timestamp verification upsertDocuments: - Batch-fetch existing docs with find() instead of per-row getDocument() - Tenant-aware composite keys for seenIds duplicate check All paths are tenant-per-document aware with null-safe array_filter.
- testCreateDocumentsIgnoreDuplicates: mixed batch with onNext assertions - testCreateDocumentsIgnoreIntraBatchDuplicates: first-wins, no ACL drift - testCreateDocumentsIgnoreAllDuplicates: zero inserts, empty onNext
f2b08dc to
bee42d4
Compare
…x _uid mapping - Pass ignoreDuplicates option to insertMany for race-condition safety - Extract _uid from raw array before replaceChars (excluded from transformation) - Map inserted records back to original Document objects by _uid
…t and fix _uid mapping" This reverts commit 93a9136.
- Add skipDuplicates(callable) scope guard on Database, following
existing pattern (skipRelationships, skipValidation, etc.)
- Remove bool $ignore parameter from createDocuments signature
- Mirror propagates skipDuplicates state to source and destination
- Update tests to use $database->skipDuplicates(function() { ... })
- Add $skipDuplicates property + skipDuplicates() scope guard to Adapter - Remove bool $ignore parameter from all adapter createDocuments signatures (Adapter, Pool, Mongo, SQL) - Remove bool $ignore from SQL helper methods (getInsertKeyword, getInsertSuffix, getInsertPermissionsSuffix) and Postgres/SQLite overrides - Pool delegate() and withTransaction() propagate skipDuplicates state to pooled/pinned adapter via the scope guard - Database::createDocuments() wraps adapter call in adapter->skipDuplicates() when the flag is set, drops the local $ignore variable
- skipDuplicates() takes optional $enable so Pool/Mirror/Database can forward state in a single call instead of branching - Database: tenantKey(Document) helper replaces ~8 inline ternaries - Database: fetchExistingByIds() helper replaces the tenant-grouped batched find() in createDocuments and upsertDocuments pre-fetch - Database: drop dead guard on non-defer relationship path - Mongo: findExistingUids() helper replaces the duplicated cursor walk in the skipDuplicates pre-filter (tenant and non-tenant variants) - SQL: buildUidTenantLookup() helper replaces the duplicated placeholder-and-binds builder shared by pre-filter and reconciliation
When a batch has only null-ID documents but non-null tenants, the previous code emitted WHERE _uid IN () AND _tenant IN (...) — invalid SQL that the !empty(binds) call-site guard failed to catch because the tenant binds kept the binds array non-empty. Return 'where' => '1=0' with empty binds when there are no UIDs, so the query matches nothing and leaves no dangling placeholders.
- testCreateDocumentsSkipDuplicatesEmptyBatch: no-op with empty array - testCreateDocumentsSkipDuplicatesNestedScope: scope guard save/restore through nested skipDuplicates() calls - testCreateDocumentsSkipDuplicatesLargeBatch: 300-doc batch with 50 pre-existing, exercising the chunk loop - testCreateDocumentsSkipDuplicatesSecondCallSkipsAll: second identical batch should preserve First values, not overwrite with Second - testCreateDocumentsSkipDuplicatesRelationships: skipDuplicates with a one-to-many relationship — verifies deferred relationship logic does not create orphan child rows for ignored parents
INSERT IGNORE / ON CONFLICT DO NOTHING / INSERT OR IGNORE handle duplicates at the DB level, and the existing reconciliation block identifies actually-inserted rows via the _createdAt timestamp. The pre-filter was a wire-data optimization that ran a SELECT before the INSERT, then trimmed the batch — but the same SELECT (with timestamp comparison) already happens in the reconciliation path on demand. Net: ~45 lines removed, single code path for skipDuplicates instead of pre-filter + fallback. Race-condition handling is unchanged (reconciliation path covers it).
Previously the Database orchestrator carried a pre-fetch + chunk-filter
pass to compensate for Mongo's plain upsert() returning matched +
upserted combined ('n') — without that compensation, $modified would
over-count the number of pre-existing docs in the batch.
This pushes the responsibility into the adapter, where it belongs:
- composer bump to utopia-php/mongo dev-upsert-return-upserted-count
(PR #37, adds Client::upsertWithCounts() returning matched/modified/
upserted[] separately)
- Mongo adapter calls upsertWithCounts() and maps the response's
upserted[].index entries back to the originating Document objects,
returning ONLY the actually-inserted docs (race-safe)
- Database::createDocuments drops $preExistingIds pre-fetch and the
chunk filter; the adapter is now responsible for accurate counts
Net effect on Database.php: ~17 lines removed. Net effect on Mongo.php:
~80 lines net deletion (replaced the inline pre-filter with a single
upsertWithCounts call + index mapping).
This reverts commit c6d566e.
The SQL adapter previously carried a pre-filter SELECT plus a _createdAt reconciliation pass to compensate for INSERT IGNORE not telling us which rows were actually inserted vs skipped. The Mongo adapter pulled the same trick via Client::upsertWithCounts() (mongo PR #37). Both layers were solving the same problem twice. Move the pre-filter back to Database::createDocuments where it ran originally, and let each adapter's INSERT IGNORE / ON CONFLICT DO NOTHING / upsert+\$setOnInsert serve only as a race-safety net. The adapter just returns its input as the inserted set. - Database.php: restore \$preExistingIds pre-fetch + chunk filter - SQL.php: drop pre-filter, drop _createdAt reconciliation, drop buildUidTenantLookup helper (-173 lines) - Mongo.php: revert to plain Client::upsert(), return input docs (withTransaction skipDuplicates bypass stays — still needed to avoid WriteConflict under snapshot isolation) - composer.json/lock: back to stable utopia-php/mongo 1.* Trade-off: in the race window between the orchestrator's pre-fetch SELECT and the adapter INSERT, \$modified over-reports by N and onNext fires for N skipped docs. Common case stays correct.
…en comments The deferred-relationships mechanism was useful when the adapter could report which docs were actually inserted (via reconciliation/ upsertWithCounts). With those gone, $batch always equals $chunk, so the defer-then-match dance is dead weight. The orchestrator already pre- filters known duplicates upfront, so any survivor is expected to insert and relationships can be created inline like the non-skipDuplicates path. - Database.php: pre-filter $documents once, before encode/validate, instead of filtering each chunk inline. Drop $deferredRelationships scaffolding and the per-chunk match loop. Collapse if/elseif into one inline createDocumentRelationships call. - Mongo.php: tighten withTransaction skipDuplicates comment (the "relationship writes are deferred" rationale is no longer accurate). - Mirror.php: tighten the captureOnNext comment. - composer.lock: revert unrelated symfony-polyfill-php85 / mongo 1.0.0→ 1.0.2 bumps that crept in from an earlier composer update. Race window unchanged: a doc that passes pre-filter but loses to a concurrent insert still produces an orphan relationship and an over-counted $modified — same as before this commit.
| * @param bool $enable | ||
| * @return T | ||
| */ | ||
| public function skipDuplicates(callable $callback, bool $enable = true): mixed |
There was a problem hiding this comment.
Let's remove the $enable flag and check if skipDuplicates instead, otherwise we add the overhead of calling an extra callable when it's false
| * @param bool $idsOnly When true, uses Query::select(['$id']) for a lighter fetch | ||
| * @return array<string, Document> | ||
| */ | ||
| private function fetchExistingByIds(Document $collection, array $documents, bool $idsOnly = false): array |
| // skipDuplicates: dedupe intra-batch (first wins) then drop already-existing rows. | ||
| // Adapter INSERT IGNORE / ON CONFLICT / upsert is the race-safety net. | ||
| if ($this->skipDuplicates) { | ||
| $seenIds = []; | ||
| $deduplicated = []; | ||
| foreach ($documents as $document) { | ||
| if ($document->getId() !== '') { | ||
| $dedupeKey = $this->tenantKey($document); | ||
| if (isset($seenIds[$dedupeKey])) { | ||
| continue; | ||
| } | ||
| $seenIds[$dedupeKey] = true; | ||
| } | ||
| $deduplicated[] = $document; | ||
| } | ||
| $documents = $deduplicated; | ||
|
|
||
| $preExistingIds = $this->fetchExistingByIds($collection, $documents, idsOnly: true); | ||
| if (!empty($preExistingIds)) { | ||
| $documents = \array_values(\array_filter( | ||
| $documents, | ||
| fn (Document $doc) => !isset($preExistingIds[$this->tenantKey($doc)]) | ||
| )); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
I don't think we need to do this if using insert ignore
There was a problem hiding this comment.
@abnegate The child collection is getting orphan rows. when multiple rows with same parent and different childs. where parent is ignored but child is not ignored
database/src/Database/Database.php
Lines 5765 to 5767 in 41704eb
| $insertedFromSource = []; | ||
| $captureOnNext = function (Document $doc) use (&$insertedFromSource, $onNext): void { | ||
| $insertedFromSource[] = $doc; | ||
| if ($onNext !== null) { | ||
| $onNext($doc); | ||
| } | ||
| }; |
There was a problem hiding this comment.
We can never capture like this internally, if there were many docs, this loop could easily OOM
| /** | ||
| * Returns the INSERT keyword, optionally with IGNORE for duplicate handling. | ||
| * Override in adapter subclasses for DB-specific syntax. | ||
| */ | ||
| protected function getInsertKeyword(): string | ||
| { | ||
| return $this->skipDuplicates ? 'INSERT IGNORE INTO' : 'INSERT INTO'; | ||
| } | ||
|
|
||
| /** | ||
| * Returns a suffix appended after VALUES clause for duplicate handling. | ||
| * Override in adapter subclasses (e.g., Postgres uses ON CONFLICT DO NOTHING). | ||
| */ | ||
| protected function getInsertSuffix(string $table): string | ||
| { | ||
| return ''; | ||
| } | ||
|
|
||
| /** | ||
| * Returns a suffix for the permissions INSERT statement when ignoring duplicates. | ||
| * Override in adapter subclasses for DB-specific syntax. | ||
| */ | ||
| protected function getInsertPermissionsSuffix(): string | ||
| { | ||
| return ''; | ||
| } | ||
|
|
There was a problem hiding this comment.
Let's group logically with other getter methods
#1 Drop $enable flag on skipDuplicates() scope guard The $enable param made every non-skipDuplicates createDocuments call pay for a closure allocation + extra function call. Branch at the call site instead so the cost only applies when the flag is actually set. - Adapter::skipDuplicates(callable, bool) → skipDuplicates(callable) - Database::skipDuplicates(callable, bool) → skipDuplicates(callable) - Database::createDocuments, Mirror::createDocuments, Pool::delegate, Pool::withTransaction now branch inline. #2 Drop fetchExistingByIds helper, inline find() The helper's per-tenant grouping defended a hypothetical multi-tenant batching scenario that no caller exercises (relationships are intra- tenant, callers always batch per tenant). Existing patterns in the same file (refetchDocuments, relationship loading) just call find() directly. Match that idiom and drop ~70 lines. #4 Mirror: only capture inserted docs in skipDuplicates mode The captureOnNext accumulator paid the cost (closure + per-doc array push) on every createDocuments call, including the common non-skip path. Branch at the entry of Mirror::createDocuments so the capture only happens when skipDuplicates is set; the non-skip path passes through to source/destination unchanged. #5 Move getInsertKeyword/Suffix/PermissionsSuffix to getters cluster Were sitting next to createDocuments(); moved to the getSupport* cluster around line 1030 where other adapter-capability shims live. Not addressed: - #2 partial: the existing patterns (refetchDocuments etc.) don't handle tenant-per-document multi-tenant batches either, so this is consistent. - #3 (drop the pre-filter): rejected. createDocumentRelationships runs in the encoding loop BEFORE the adapter's INSERT IGNORE no-ops the parent, so dropping the pre-filter would deterministically duplicate child rows on every CSV re-import of a collection with relationships (not a race window — every call). The relationships test verifies this. Reverting would require reintroducing the deferred-relationships scaffolding we just removed, and the adapter still couldn't tell us which parents were actually inserted (SQL INSERT IGNORE has no per-row reporting). Pre-filter stays.
Per Jake's #4 review comment, the captureOnNext accumulator could grow unbounded with the full inserted set. Replace it with a bounded buffer that flushes to the destination every $batchSize docs as the source streams them through onNext, plus a tail flush after the source returns. Memory peak in skipDuplicates mode is now O($batchSize) (~1000 docs) regardless of how many docs the source inserts. Same destination batching efficiency — destination still gets multi-doc createDocuments calls, just one per chunk instead of one giant call at the end. Eligibility checks (SOURCE_ONLY_COLLECTIONS, upgrade status) moved upfront so the flush closure can capture the decision and short-circuit cleanly. Non-skip path is unchanged: direct passthrough, no buffer. The flush-on-fill is load-bearing for the memory bound — see the in-method comment.
…erDocument When inlining find() per Jake's review, the per-tenant grouping the deleted fetchExistingByIds helper used to do was lost. In tenantPerDocument mode with a multi-tenant batch, the inlined find() runs under the session tenant context (often null for platform workers) and silently misses rows belonging to other tenants — every input doc looks "new" and goes down the create path instead of the update path. This actively breaks appwrite's StatsUsage worker, which accumulates stats across many projects (each tagged with its own $tenant) and flushes them via $dbForLogs->setTenant(null)->setTenantPerDocument(true) followed by upsertDocumentsWithIncrease(...). With the old per-doc getDocument loop, each lookup ran under withTenant($doc->getTenant()) and was correct. The batch find() inlining lost that scoping. Fix per Greptile's suggestion: branch on tenantPerDocument mode and run one find() per unique tenant value under withTenant, merging results into the same lookup hash. K queries (K = unique tenants in the batch) instead of N (per-doc) or 1 (broken). Common single-tenant case still hits the fast batched path. Applied at both call sites: - Database::createDocuments skipDuplicates pre-filter - Database::upsertDocumentsWithIncrease existing-docs lookup Fixes greptile flag at PR #852 discussion r3077481595.
Per Jake's review feedback, the pre-filter in Database::createDocuments was redundant with the adapter's INSERT IGNORE / ON CONFLICT DO NOTHING / upsert + \$setOnInsert path. The adapter already handles duplicates atomically; the upfront SELECT duplicated that work. The semantic is now row-level dedup: - pre-existing rows are silently no-op'd by the adapter - new rows go through - retries are idempotent - source data is never overwritten Tests updated to match: - 5 skipDuplicates tests adjusted to expect full-chunk counts, with data-level assertions unchanged (pre-existing values preserved, new rows inserted). - testCreateDocumentsSkipDuplicatesRelationships encodes the retry scenario: missing nested child gets created via relateDocuments' idempotent path when the parent already exists. - testCreateDocumentsSkipDuplicatesDoesNotDivergeDestination relaxed to source-authoritative semantics only. Destination may transiently hold would-be values during the migration window; backfill reconciles. All 105 skipDuplicates tests across 11 adapter variants pass, full Mirror suite (636 tests) passes, PHPStan level 7 clean.
Follow-up to Jake's #5 review comment — the fix was only applied to SQL.php. Propagate the same grouping to the other adapters and fix one cosmetic byproduct from the Mirror refactor. - Postgres.php: move getInsertKeyword/Suffix/PermissionsSuffix from between createDocuments() and getUpsertStatement() down to the getSupportFor* cluster (matches SQL.php:1036 layout). - SQLite.php: move getInsertKeyword override from the top of the class to after the getSupportFor* cluster. - Adapter.php: skipDuplicates() method was inserted between protected property declarations, breaking the property cluster. Move the method down next to withTransaction() — the other callable-taking scope guard. $skipDuplicates property stays with its peers. - Mirror.php: revert the incidental fn () => collapse on the non-skip destination createDocuments call back to the original two-line form, minimizing unrelated diff noise. Pure relocation / style. No behavior change.
Addresses Greptile #3084293974. The captureOnNext-based forwarding was structurally broken post-pre-filter-removal: SQL adapters return the full input batch from createDocuments regardless of how many rows INSERT IGNORE actually inserted, so onNext fired for every document including skipped duplicates. captureOnNext buffered all of them and flushed the full buffer (carrying the caller's would-be values) to destination, diverging destination from source whenever source had a pre-existing row the caller tried to overwrite. Fix: instead of trying to distinguish inserted vs skipped at insert time, re-fetch source's authoritative state via find() after the adapter write settles, then forward that to destination. Race-free regardless of concurrent writers — destination always receives a faithful snapshot of whatever source holds, because the query runs after source's write has resolved. Cost: one extra SELECT per batch against source when skipDuplicates is active AND Mirror has an upgraded destination. Zero cost on the non-skip path. Memory remains O(batch_size). Net ~47 lines smaller than the previous capture-based approach: - captureOnNext closure removed - flushBuffer closure removed - bounded-buffer machinery removed - shouldMirror upfront-snapshot removed (inlined as eligibility check) Test: testCreateDocumentsSkipDuplicatesDoesNotDivergeDestination now asserts destination holds source's 'Original' value rather than the caller's 'WouldBe' input, per Greptile's suggested assertion.
…p paths Addresses Greptile #3084293974. The captureOnNext-based forwarding was structurally broken: SQL adapters return the full input batch from createDocuments regardless of what INSERT IGNORE actually inserted, so onNext fired for every doc including skipped duplicates. captureOnNext then forwarded the caller's would-be values to destination, diverging destination from source whenever source had a pre-existing row the caller tried to re-insert. Fix: pre-filter against source before the insert to identify which input ids already exist, then forward only the genuinely-new docs to destination. A source-skipped duplicate is a no-op (no user write happened), so nothing should propagate. This matches Greptile's semantics: destination.dup.isEmpty() after skipping a duplicate. While the logic was touched, the skip and non-skip paths (which shared eligibility check, upgrade check, clone loop, destination write, and after-filter loop) are unified into a single flow with $this-> skipDuplicates branches only at the points where behavior actually differs: - source call: wrapped in source->skipDuplicates when active - $toForward: filter out pre-existing ids when active - destination call: wrapped in destination->skipDuplicates when active Non-skip hot path pays zero new closure allocations; the pre-filter block is gated behind if ($this->skipDuplicates) and short-circuits to an empty $existingIds map. Net: -46 lines in createDocuments (~170 → ~95), same test coverage, Greptile's assertion now holds.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/Database/Mirror.php (1)
604-656:⚠️ Potential issue | 🟠 MajorPre-filter lookup is not tenant-aware for
tenantPerDocumentmode.When
tenantPerDocumentis enabled, documents with the same$idbut different tenants are distinct. The current implementation:
- Calls
find()with all IDs at once without grouping by tenant (line 616)- Keys
$existingIdsby$idalone (line 622)- Filters
$toForwardby$idalone (line 650)If the batch contains documents for multiple tenants, this incorrectly excludes new documents whose ID exists for a different tenant, or fails to detect true duplicates for the correct tenant.
Use the same tenant-aware pattern as
Database::upsertDocumentsWithIncrease(): whentenantPerDocumentis enabled, group documents by tenant and callfind()per-tenant usingwithTenant(), then use composite keys ($tenant:$id) for tracking existing IDs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/Database/Mirror.php` around lines 604 - 656, The pre-filter that builds $existingIds and filters $toForward must be made tenant-aware when tenantPerDocument is enabled: instead of calling $this->source->find(...) once for all IDs and keying by $id, group the $documents by tenant (use the same pattern as Database::upsertDocumentsWithIncrease()), and for each tenant call $this->source->withTenant($tenant)->silent(fn() => ->find(...)) to fetch only that tenant's existing IDs; store keys in $existingIds as composite "$tenant:$id" and update the filtering lambda for $toForward (and the earlier array_map/array_filter that builds $ids) to use the same composite key so only documents with matching tenant+id are considered duplicates.
♻️ Duplicate comments (1)
src/Database/Database.php (1)
5728-5732:⚠️ Potential issue | 🔴 Critical
skipDuplicates()still leaves relationship side effects behind.By the time execution reaches Line 5728, every input document has already gone through
createDocumentRelationships(). If the adapter then ignores a duplicate parent row, those child/junction writes are still committed. That can orphan rows or mutate relationships on an existing document even though the parent create was skipped. Relationship writes need to be driven from the successfully inserted$batch, not from the pre-insert input list.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/Database/Database.php` around lines 5728 - 5732, The current flow runs createDocumentRelationships() before calling the adapter and then uses adapter->skipDuplicates() which can ignore parent inserts while relationship writes have already been applied; change the flow so the adapter call (the $insert closure -> $this->withTransaction(fn () => $this->adapter->createDocuments($collection, $chunk))) executes first to produce the actual inserted $batch (ensure adapter->skipDuplicates(...) returns the real inserted batch when used), and only after $batch is obtained call createDocumentRelationships() and any child/junction writes against $batch (not the original input). Adjust skipDuplicates() usage so it returns the inserted documents batch and ensure relationship-writing code references $batch instead of the pre-insert input list.
🧹 Nitpick comments (2)
src/Database/Adapter/Postgres.php (1)
2358-2358: Silence the unused override parameter warning.
$tableis intentionally unused here. If the parent signature must stay as-is, rename it to$_tableor add a local suppression so PHPMD stops flagging this override.♻️ Small cleanup
- protected function getInsertSuffix(string $table): string + protected function getInsertSuffix(string $_table): string🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/Database/Adapter/Postgres.php` at line 2358, The override of getInsertSuffix(string $table) triggers an unused-parameter warning; rename the unused parameter to $_table (or prefix it with an underscore) in the getInsertSuffix method signature or add a local PHPMD suppression comment inside that method so the linter stops flagging it; update the method signature in the Postgres class (getInsertSuffix) and keep the parent signature unchanged.tests/e2e/Adapter/Scopes/DocumentTests.php (1)
7930-7933: Clarify whether these assertions validate “processed” vs “created” semantics.At Line 7930, Line 7984, Line 8110, and Line 8155, the tests assert counts and
onNextemissions include duplicate IDs. That encodes a “processed inputs” contract, not “newly created rows.” Please make this explicit (e.g., rename to$processedCount/ add a short comment) or add before/after count assertions to separately validate created-row deltas.Also applies to: 7984-7986, 8110-8112, 8155-8157
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/Adapter/Scopes/DocumentTests.php` around lines 7930 - 7933, The assertions using $count and $emittedIds currently validate “processed inputs” semantics (they expect duplicate IDs for repeated processing) rather than newly created rows; update the test to make this explicit by renaming variables (e.g., $processedCount, $processedEmittedIds) or adding a short comment above the block clarifying “processed inputs may include duplicates,” and if you also need to assert created-row deltas add before/after row-count checks (capture $beforeRows = ... then run operation then assert $afterRows - $beforeRows equals expected new creations) around the assertions that reference $count and $emittedIds (the same pattern appears for the other occurrences currently asserting duplicates).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/Database/Database.php`:
- Around line 7153-7184: The pre-read ID queries in the branches that build
$idsByTenant/$docIds must be chunked to respect the Query::equal('$id', ...)
limit controlled by $this->maxQueryValues; update the tenant loop (where you
build $idsByTenant and call $this->find(... Query::equal('$id', $tenantIds)
...)) and the else branch (where $docIds is used in $this->find(...
Query::equal('$id', $docIds) ...)) to split each array with array_chunk(...,
max(1, $this->maxQueryValues)) and call find for each chunk (using the same
authorization->skip/withTenant/silent wrappers), then merge all returned
documents into $existingDocs (use the same key format $tenant . ':' .
$doc->getId() in the tenant branch) or $existing as before so downstream code
sees the combined results. Ensure you preserve unique IDs before chunking and
maintain limit counts per chunk.
In `@src/Database/Mirror.php`:
- Around line 614-624: The prefetch call in Mirror.php uses Query::equal('$id',
$ids) with an unchunked $ids array which can exceed DocumentsValidator limits;
change the logic that builds $existing by chunking $ids with array_chunk($ids,
max(1, $this->maxQueryValues)) (same pattern used in createDocuments() and
upsertDocumentsWithIncrease()), call $this->source->silent(fn() =>
$this->source->find($collection, [Query::equal('$id', $chunk),
Query::limit(count($chunk))])) for each chunk, and merge results into
$existingIds[$doc->getId()] = true to preserve current behavior. Ensure you
reference $this->maxQueryValues and use Query::equal('$id', $chunk) and
Query::limit(count($chunk)) per chunk.
---
Outside diff comments:
In `@src/Database/Mirror.php`:
- Around line 604-656: The pre-filter that builds $existingIds and filters
$toForward must be made tenant-aware when tenantPerDocument is enabled: instead
of calling $this->source->find(...) once for all IDs and keying by $id, group
the $documents by tenant (use the same pattern as
Database::upsertDocumentsWithIncrease()), and for each tenant call
$this->source->withTenant($tenant)->silent(fn() => ->find(...)) to fetch only
that tenant's existing IDs; store keys in $existingIds as composite
"$tenant:$id" and update the filtering lambda for $toForward (and the earlier
array_map/array_filter that builds $ids) to use the same composite key so only
documents with matching tenant+id are considered duplicates.
---
Duplicate comments:
In `@src/Database/Database.php`:
- Around line 5728-5732: The current flow runs createDocumentRelationships()
before calling the adapter and then uses adapter->skipDuplicates() which can
ignore parent inserts while relationship writes have already been applied;
change the flow so the adapter call (the $insert closure ->
$this->withTransaction(fn () => $this->adapter->createDocuments($collection,
$chunk))) executes first to produce the actual inserted $batch (ensure
adapter->skipDuplicates(...) returns the real inserted batch when used), and
only after $batch is obtained call createDocumentRelationships() and any
child/junction writes against $batch (not the original input). Adjust
skipDuplicates() usage so it returns the inserted documents batch and ensure
relationship-writing code references $batch instead of the pre-insert input
list.
---
Nitpick comments:
In `@src/Database/Adapter/Postgres.php`:
- Line 2358: The override of getInsertSuffix(string $table) triggers an
unused-parameter warning; rename the unused parameter to $_table (or prefix it
with an underscore) in the getInsertSuffix method signature or add a local PHPMD
suppression comment inside that method so the linter stops flagging it; update
the method signature in the Postgres class (getInsertSuffix) and keep the parent
signature unchanged.
In `@tests/e2e/Adapter/Scopes/DocumentTests.php`:
- Around line 7930-7933: The assertions using $count and $emittedIds currently
validate “processed inputs” semantics (they expect duplicate IDs for repeated
processing) rather than newly created rows; update the test to make this
explicit by renaming variables (e.g., $processedCount, $processedEmittedIds) or
adding a short comment above the block clarifying “processed inputs may include
duplicates,” and if you also need to assert created-row deltas add before/after
row-count checks (capture $beforeRows = ... then run operation then assert
$afterRows - $beforeRows equals expected new creations) around the assertions
that reference $count and $emittedIds (the same pattern appears for the other
occurrences currently asserting duplicates).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 06d46ba1-a7eb-478b-ad90-a27f372a0327
📒 Files selected for processing (10)
src/Database/Adapter.phpsrc/Database/Adapter/Mongo.phpsrc/Database/Adapter/Pool.phpsrc/Database/Adapter/Postgres.phpsrc/Database/Adapter/SQL.phpsrc/Database/Adapter/SQLite.phpsrc/Database/Database.phpsrc/Database/Mirror.phptests/e2e/Adapter/MirrorTest.phptests/e2e/Adapter/Scopes/DocumentTests.php
🚧 Files skipped from review as they are similar to previous changes (3)
- src/Database/Adapter.php
- src/Database/Adapter/SQLite.php
- src/Database/Adapter/Mongo.php
Addresses CodeRabbit #3084994641 and #3084994657 (+ Greptile #3084466265).
Three call sites were passing unchunked id arrays into Query::equal('$id',
...), which throws QueryException from DocumentsValidator once the array
exceeds 5000 values.
Mirror::createDocuments (skipDuplicates pre-filter): defensive — Mirror's
public API accepts arbitrary-size input, and while current callers pass
batches below the limit, the code shouldn't throw on a theoretical
direct caller with >5000 input ids.
Database::upsertDocumentsWithIncrease (both tenantPerDocument and single-
tenant branches): real regression. The batched pre-read was introduced
in ae929db to restore per-tenant grouping for StatsUsage. It replaced
a per-doc getDocument() loop on main (which is safe from validator
limits because each call queries one id). Passing the full input array
to Query::equal now breaks at >5000 input docs — a path that previously
worked slowly on main.
All three sites use self::RELATION_QUERY_CHUNK_SIZE (5000) and
Query::limit(PHP_INT_MAX), matching the existing pattern used by the
five populateDocumentsRelationshipsBatch chunking sites. Accumulator
shape is adapted to our map-only use case: we build the existence set /
lookup map directly inside the chunk loop instead of carrying a flat
list we don't need.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
src/Database/Database.php (2)
7160-7187:⚠️ Potential issue | 🟠 MajorRespect
maxQueryValueswhen chunking the pre-read lookups.These
find()calls still validateQuery::equal('$id', $chunk)against$this->maxQueryValues. If a caller lowers that limit below 5000, chunking byRELATION_QUERY_CHUNK_SIZEwill still fail before the upsert starts.✂️ Suggested change
- foreach (\array_chunk($tenantIds, self::RELATION_QUERY_CHUNK_SIZE) as $chunk) { + foreach (\array_chunk($tenantIds, \max(1, $this->maxQueryValues)) as $chunk) { @@ - foreach (\array_chunk($docIds, self::RELATION_QUERY_CHUNK_SIZE) as $chunk) { + foreach (\array_chunk($docIds, \max(1, $this->maxQueryValues)) as $chunk) {Based on learnings, both createDocuments() and upsertDocumentsWithIncrease() prefetch existing document IDs in chunks using
array_chunk(..., max(1, $this->maxQueryValues))to satisfyDocumentsValidatorlimits onQuery::equal('$id', ...).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/Database/Database.php` around lines 7160 - 7187, The pre-read find() calls chunk IDs using self::RELATION_QUERY_CHUNK_SIZE but Query::equal('$id', $chunk) is validated against $this->maxQueryValues, so change the chunking in both loops (the tenant branch that iterates over $idsByTenant and the else branch that iterates over $docIds) to use array_chunk(..., max(1, $this->maxQueryValues)) instead of self::RELATION_QUERY_CHUNK_SIZE so the chunk size never exceeds the DocumentsValidator limit when calling find().
5728-5732:⚠️ Potential issue | 🔴 Critical
skipDuplicates()still lets skipped rows mutate relationship state.By the time this guard runs, every input document has already gone through
createDocumentRelationships()at Lines 5720-5722. A row ignored here can still create/update children or junction rows, and duplicate child/junction inserts triggered during that earlier relationship pass still go through the normalcreateDocument()path. The insert result needs to be authoritative for relationship work, or those writes need to be deferred until after you know which rows were actually inserted.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/Database/Database.php` around lines 5728 - 5732, Summary: duplicate-skipping allows skipped rows to still mutate relationships because createDocumentRelationships runs before the adapter decides which rows are inserted. Fix: change the control flow so relationship work is driven by the authoritative insert result — either defer createDocumentRelationships until after $batch is computed or make adapter->skipDuplicates return per-row insertion outcomes and use that to gate relationship writes; specifically update the sequence around the $insert closure, withTransaction, adapter->skipDuplicates and createDocumentRelationships (and any calls to createDocument) so relationship/junction creation only runs for rows confirmed inserted by createDocuments (or is skipped per the adapter-provided insertion map).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/Database/Database.php`:
- Around line 7160-7187: The pre-read find() calls chunk IDs using
self::RELATION_QUERY_CHUNK_SIZE but Query::equal('$id', $chunk) is validated
against $this->maxQueryValues, so change the chunking in both loops (the tenant
branch that iterates over $idsByTenant and the else branch that iterates over
$docIds) to use array_chunk(..., max(1, $this->maxQueryValues)) instead of
self::RELATION_QUERY_CHUNK_SIZE so the chunk size never exceeds the
DocumentsValidator limit when calling find().
- Around line 5728-5732: Summary: duplicate-skipping allows skipped rows to
still mutate relationships because createDocumentRelationships runs before the
adapter decides which rows are inserted. Fix: change the control flow so
relationship work is driven by the authoritative insert result — either defer
createDocumentRelationships until after $batch is computed or make
adapter->skipDuplicates return per-row insertion outcomes and use that to gate
relationship writes; specifically update the sequence around the $insert
closure, withTransaction, adapter->skipDuplicates and
createDocumentRelationships (and any calls to createDocument) so
relationship/junction creation only runs for rows confirmed inserted by
createDocuments (or is skipped per the adapter-provided insertion map).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c7a16723-e265-42e8-8fd4-3a629176caf0
📒 Files selected for processing (2)
src/Database/Database.phpsrc/Database/Mirror.php
🚧 Files skipped from review as they are similar to previous changes (1)
- src/Database/Mirror.php
…SIZE
CodeRabbit #3084994641 + #3084994657 follow-up: the earlier chunking
fix used self::RELATION_QUERY_CHUNK_SIZE (5000), which matches the
existing populateDocumentsRelationshipsBatch pattern but doesn't
respect the user-configurable validator ceiling.
Query::equal('\$id', \$chunk) is validated by DocumentsValidator
against \$this->maxQueryValues, not RELATION_QUERY_CHUNK_SIZE. A
caller doing \$db->setMaxQueryValues(100) followed by any of the
three chunked pre-reads would still throw QueryException because
chunks of 5000 exceed the configured 100.
Switch to array_chunk(..., max(1, \$this->maxQueryValues)) at all
three sites so chunk size always respects the actual validator
limit. In the default case (maxQueryValues = 5000) this produces
identical chunks to the old code; under customization it correctly
scales down.
Note: the 5 existing relationship-population chunking sites still
use RELATION_QUERY_CHUNK_SIZE and have the same latent bug under
customized maxQueryValues. Leaving that for a follow-up since
fixing the pattern globally is out of scope for this PR.
CSV / JSON / appwrite-to-appwrite imports that re-run on the same batch (e.g. user re-uploads the same file, or a worker retries a failed chunk) currently throw DuplicateException and abort the whole batch. Wrap the row-buffer flush in the new skipDuplicates() scope guard so duplicate-by-id rows are silently no-op'd at the adapter layer (INSERT IGNORE / ON CONFLICT DO NOTHING / $setOnInsert), letting the rest of the batch proceed. The existing skipRelationshipsExistCheck() wrapper is preserved (FK-target guard); skipDuplicates composes around it. Feature-branch note: depends on utopia-php/database's skipDuplicates() scope guard from PR utopia-php/database#852. composer.json is temporarily pinned to dev-csv-import-upsert-v2 with a 5.99.0 alias so composer can resolve the 5.* constraint transitively. Must be reset to the proper release version (^5.X.Y) once PR #852 merges and utopia-php/database ships.
Exposes two new optional boolean params on the three migration
creation endpoints so CSV / JSON / appwrite-to-appwrite imports can
choose how to handle rows whose IDs already exist at the destination.
Endpoints updated (app/controllers/api/migrations.php):
- POST /v1/migrations/appwrite
- POST /v1/migrations/csv/imports
- POST /v1/migrations/json/imports
Parameter semantics:
- overwrite=true -> destination uses upsertDocuments instead of
createDocuments; existing rows are replaced
with imported values
- skip=true -> destination wraps createDocuments in
skipDuplicates; existing rows are preserved
unchanged, duplicate-id rows silently no-op
- both false -> default; fails fast on DuplicateException
(original behavior, unchanged)
- both true -> overwrite wins (upsert subsumes skip)
Both params are stored in the migration Document's options array
(matches the existing pattern for destination behavior config like
path, size, delimiter, bucketId, etc.) and read back in the worker's
processDestination() to instantiate DestinationAppwrite with the
new constructor params.
Feature-branch note: depends on utopia-php/migration#feat/skip-duplicates
(DestinationAppwrite constructor params) which in turn depends on
utopia-php/database#852 (skipDuplicates scope guard). composer.json is
temporarily pinned to dev-feat/skip-duplicates and
dev-csv-import-upsert-v2 respectively; both must be reset to proper
release versions once the upstream PRs merge.
Summary
Adds a
Database::skipDuplicates()scope guard so callers can wrap acreateDocuments()call and have duplicate-by-id rows silently skipped instead of throwingDuplicateException. Used by CSV import / re-import flows and database migration workers where re-running the same batch should be idempotent rather than fatal.The shape follows the existing
silent()/skipRelationships()pattern — instance-level scope guard, not a flag parameter — so callers compose naturally:Design
Duplicate handling happens at the adapter layer via dialect-specific no-op inserts.
Database::createDocumentssets a scope-guarded flag on the adapter before each chunk's INSERT; the adapter emits the right SQL (or Mongo equivalent) and returns the batch. No pre-filter, no intra-batch dedup — adapter-level row-level dedup is sufficient and avoids redundant queries.Per-adapter implementation
INSERT IGNORE INTOINSERT INTO ... ON CONFLICT (...) DO NOTHINGINSERT OR IGNORE INTOupsert + $setOnInsert, bypassing the transaction wrap to avoid txn-abort-on-duplicateHot-path cost
One property read + one boolean branch per chunk. Callers that don't use
skipDuplicatespay no closure allocation and no scope-guard setup — the feature cost is paid only when the feature is actually used. This addresses review comment #1 (don't penalize the common case with an$enableflag pattern).Mirror dual-write correctness
Mirror::createDocumentspre-filters against source to identify which input ids already exist, then after the source insert forwards only the genuinely-new docs to destination. Pre-existing ids were silently no-op'd by the adapter'sINSERT IGNORE— they represent no user write, so nothing propagates to destination.This replaces an earlier
captureOnNext-based approach flagged by Greptile (#3084293974): that approach wrapped$onNextto capture "what source inserted", but SQL adapters return the full input batch fromcreateDocumentsregardless ofrowCount(), causing$onNextto fire for every doc — including skipped duplicates — so Mirror forwarded would-be values to destination, diverging from source. Pre-filter-then-insert matches the idempotent-write semantics: a skipped duplicate is not a user write, so it does not mirror.Cost: one extra
SELECTagainst source per batch whenskipDuplicatesis active and Mirror has an upgraded destination. Zero cost on the non-skip path. Skip and non-skip paths are unified into a single method body with branches only at the three points where behavior actually differs (source call, forward filter, destination call).upsertDocumentsWithIncreasetenant grouping (regression fix)Restored per-tenant grouping for the existing-doc lookup in
sharedTables + tenantPerDocumentmode. An earlier review round (#2 — inlinefind()instead of a helper) was correct for single-tenant use cases but ran under the session tenant context (nullfor platform workers), silently missing rows from other tenants. This broke appwrite's StatsUsage worker, which accumulates stats across many projects and flushes via$dbForLogs->setTenant(null)->setTenantPerDocument(true) + upsertDocumentsWithIncrease()— every input doc looked "new" and went down the create path instead of the update path.The fix: branch on
tenantPerDocumentmode and run onefind()per unique tenant wrapped inwithTenant($tenant, ...). K queries total (K = unique tenants in the batch). Single-tenant callers (the common case, including CSV import and database migration) still hit the fast one-query batched path.Helper:
tenantKey(Document)— used only withinupsertDocumentsWithIncreaseto keep the in-memory lookup map and the intra-batch dedup check consistent. Returns"<tenant>:<id>"intenantPerDocumentmode, just<id>otherwise. Without it, a legitimate cross-tenant batch like[{id: x, tenant: 1}, {id: x, tenant: 2}]would falsely fail withDuplicateException.Tests
End-to-end coverage across all adapter variants (MariaDB / MySQL / Postgres / SQLite / MongoDB × Pool / Mirror / SharedTables):
testCreateDocumentsIgnoreDuplicates— mix of existing + new docs, non-existing inserted, existing no-op'dtestCreateDocumentsIgnoreAllDuplicates— every doc already exists, nothing newly insertedtestCreateDocumentsSkipDuplicatesEmptyBatch— empty batch is a no-optestCreateDocumentsSkipDuplicatesNestedScope— scope guard saves & restores through nested callstestCreateDocumentsSkipDuplicatesLargeBatch— 300 docs / 50 pre-existing, exercises the chunk looptestCreateDocumentsSkipDuplicatesSecondCallSkipsAll— idempotent re-invocation preserves first-write valuestestCreateDocumentsSkipDuplicatesRelationships— relationship children aren't duplicated on re-runtestCreateDocumentsSkipDuplicatesDoesNotDivergeDestination(Mirror) — source-seededdup=Originalwith adup=WouldBecaller input, assertsdestination.dupstays empty anddestination.freshgets the newly-inserted row (Greptile's exact assertion from #3084293974)Test plan
composer analyze(PHPStan) clean at the project's configured leveltestCreateDocuments*andtestUpsert*suitesSummary by CodeRabbit
New Features
Refactor
Tests