Fixed smart count false positive on subquery JOINs#534
Fixed smart count false positive on subquery JOINs#534kevinansfield wants to merge 3 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughReplaces SQL-text/reg-ex parsing for multi-table detection with inspection of Knex query-builder internals. hasMultiTableSource(queryBuilder) now checks Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Detailed analysis
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #534 +/- ##
==========================================
+ Coverage 99.23% 99.37% +0.14%
==========================================
Files 127 127
Lines 6404 6406 +2
Branches 1227 1231 +4
==========================================
+ Hits 6355 6366 +11
+ Misses 49 40 -9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/bookshelf-pagination/test/pagination.test.js (1)
364-381: Clarify why CTE-only queries don't require distinct count.The test correctly expects
count(*)for a CTE-only query, but the reasoning may not be obvious to future maintainers. CTEs (WITHclauses) define named subqueries but don't inherently duplicate rows in the outer query—the outer query still selects from a single table.Consider adding a brief comment explaining this behavior, similar to other tests.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/bookshelf-pagination/test/pagination.test.js` around lines 364 - 381, Add a brief inline comment in the test "useSmartCount uses count(*) when AST has only a CTE (with) grouping" explaining that a CTE (WITH clause) defines a named subquery but does not duplicate rows in the outer query, so the outer query still selects from a single table and a simple count(*) is sufficient; place this comment near the model.query override or right before the fetchPage call (references: model.query override, fetchPage({..., useSmartCount: true}), and the final assertion modelState.rawCalls[0] === 'count(*) as aggregate') so future maintainers understand why distinct counting isn't needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/bookshelf-pagination/test/pagination.test.js`:
- Around line 364-381: Add a brief inline comment in the test "useSmartCount
uses count(*) when AST has only a CTE (with) grouping" explaining that a CTE
(WITH clause) defines a named subquery but does not duplicate rows in the outer
query, so the outer query still selects from a single table and a simple
count(*) is sufficient; place this comment near the model.query override or
right before the fetchPage call (references: model.query override,
fetchPage({..., useSmartCount: true}), and the final assertion
modelState.rawCalls[0] === 'count(*) as aggregate') so future maintainers
understand why distinct counting isn't needed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0dd597c9-0443-4307-8bb3-7b8e7020ac53
📒 Files selected for processing (2)
packages/bookshelf-pagination/lib/bookshelf-pagination.jspackages/bookshelf-pagination/test/pagination.test.js
hasMultiTableSource ran a JOIN regex against the full compiled SQL, so any query with `where id in (select ... inner join ...)` fell back to `count(distinct posts.id)` even though the outer query was still single table. On a 267k-post Ghost site filtered by author this caused a ~20% regression in the count aggregate for no benefit — the weedout step already guaranteed distinct rows. Switched the detection to walk knex's query-builder AST (`_statements` and `_single.table`) so joins, unions, derived/raw FROM sources and comma-separated FROM lists are read as structured data. Subquery JOINs live under a `where` grouping and are correctly ignored. Since bookshelf is built on knex, the builder always exposes `_statements` / `_single`, so no SQL fallback is needed: the whole check collapses to a ~30-line AST walk with no regexes, no toSQL() call and no paren-stripping.
47695a6 to
8179605
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
packages/bookshelf-pagination/test/pagination.test.js (1)
80-97: Add one real query-builder smoke test alongside this stub helper.These tests only emulate
_statements/_singlewith plain objects, so they will not catch a Knex/Bookshelf upgrade that changes the actual private builder shapehasMultiTableSource()depends on. One smoke test built from a real builder would make this suite much less brittle.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/bookshelf-pagination/test/pagination.test.js` around lines 80 - 97, Add a real query-builder smoke test (in addition to tests using stubCountQuery) that constructs an actual Knex/Bookshelf query with a multi-table source (e.g., call model.query().from('tableA').join('tableB', ...) or otherwise produce multiple sources) and then invoke the function under test (the helper that checks multi-table sources, e.g., hasMultiTableSource or the pagination entrypoint that forks count query) to assert it returns the expected boolean/behavior; do not modify stubCountQuery—instead write a separate test that does not replace model.query(), uses the real builder returned by model.query(), builds a multi-table query, and asserts the correct detection of multi-table sources so upgrades to Knex/Bookshelf will be caught.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/bookshelf-pagination/lib/bookshelf-pagination.js`:
- Around line 38-47: The code detects non-string/derived FROM sources via
queryBuilder._single.table (variable table) but later still emits count(distinct
${tableName}.${idAttribute}) in the useSmartCount branch, which produces invalid
SQL for derived tables; modify the useSmartCount logic so that when
queryBuilder._single.table is not a plain string (or contains commas) it does
not choose the distinct-by-table form—i.e., treat that case as multi-source and
instead fall back to the safe simple count path (disable useSmartCount or avoid
referencing tableName/idAttribute) so count SQL never references out-of-scope
columns for QueryBuilder/Raw FROMs.
- Around line 25-43: The hasMultiTableSource function assumes internals exist
and directly dereferences queryBuilder._statements and
queryBuilder._single.table which can throw; guard these accesses by first
checking that queryBuilder and queryBuilder._statements are arrays before
iterating and that queryBuilder._single exists (and has a table property) before
reading table, falling back to the safe multi-source result (i.e., treat as
multi-table) when those internals are missing or of unexpected types so the code
will then use the count(distinct ...) path instead; update checks around the
symbols queryBuilder._statements and queryBuilder._single.table in
hasMultiTableSource accordingly.
---
Nitpick comments:
In `@packages/bookshelf-pagination/test/pagination.test.js`:
- Around line 80-97: Add a real query-builder smoke test (in addition to tests
using stubCountQuery) that constructs an actual Knex/Bookshelf query with a
multi-table source (e.g., call model.query().from('tableA').join('tableB', ...)
or otherwise produce multiple sources) and then invoke the function under test
(the helper that checks multi-table sources, e.g., hasMultiTableSource or the
pagination entrypoint that forks count query) to assert it returns the expected
boolean/behavior; do not modify stubCountQuery—instead write a separate test
that does not replace model.query(), uses the real builder returned by
model.query(), builds a multi-table query, and asserts the correct detection of
multi-table sources so upgrades to Knex/Bookshelf will be caught.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 19cefd58-febf-4215-972c-39cddbe1a1ee
📒 Files selected for processing (2)
packages/bookshelf-pagination/lib/bookshelf-pagination.jspackages/bookshelf-pagination/test/pagination.test.js
There was a problem hiding this comment.
♻️ Duplicate comments (2)
packages/bookshelf-pagination/lib/bookshelf-pagination.js (2)
25-43:⚠️ Potential issue | 🟠 MajorGuard private Knex fields before dereferencing them.
Line 26 and Line 42 assume
_statements/_singlealways exist. Ifmodel.query()returns a wrapped builder, a partial test stub, or a future Knex version reshapes these internals,fetchPage()now throws synchronously before the existing promise error handling runs. Default this helper to the conservative path instead of iterating private fields unconditionally.🛡️ Minimal hardening
function hasMultiTableSource(queryBuilder) { + if (!queryBuilder || !Array.isArray(queryBuilder._statements) || !queryBuilder._single) { + return true; + } + for (const statement of queryBuilder._statements) { // Any outer join duplicates the base row set. if (statement.grouping === 'join') { return true; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/bookshelf-pagination/lib/bookshelf-pagination.js` around lines 25 - 43, The helper hasMultiTableSource currently dereferences private fields queryBuilder._statements and queryBuilder._single without checks; modify hasMultiTableSource to first verify that queryBuilder._statements is an array before looping (if not, treat as multi-source and return true), and similarly check that queryBuilder._single exists and that queryBuilder._single.table is present and a string before using it (otherwise return true). Update the logic in hasMultiTableSource to take the conservative path (return true) whenever those private internals are missing or of unexpected types so we avoid synchronous crashes.
38-47:⚠️ Potential issue | 🟠 MajorDerived
FROMsources need a different count strategy than joins/unions.After Line 43 returns
true, Line 237 still falls back tocount(distinct ${tableName}.${idAttribute}). That works for normal joins, butFROM (subquery) AS aliasusually puts${tableName}out of scope, souseSmartCountcan turn a valid query into a SQL error/BadRequestError. This helper needs to distinguish “count(*)is unsafe” from “the base table is still addressable” instead of collapsing both into one boolean.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/bookshelf-pagination/lib/bookshelf-pagination.js` around lines 38 - 47, The current helper sets a single boolean when the FROM is non-standard (queryBuilder._single.table) which later causes useSmartCount logic to emit count(distinct ${tableName}.${idAttribute}) even for derived FROMs (FROM (subquery) AS alias) where ${tableName} is out of scope; change the helper to return a richer signal (e.g., { multiSource: true, derivedFrom: true } or two booleans) or add a separate function/isDerivedFrom flag that specifically detects a derived FROM (table is non-string or a Raw that wraps a subquery), then update the downstream logic in useSmartCount (and the code path that builds count(distinct ${tableName}.${idAttribute})) to check the derivedFrom flag and avoid emitting table-scoped counts for derived FROMs (use count(*) or wrap the query instead) while still allowing distinct id counts for multi-source/joins when derivedFrom is false.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@packages/bookshelf-pagination/lib/bookshelf-pagination.js`:
- Around line 25-43: The helper hasMultiTableSource currently dereferences
private fields queryBuilder._statements and queryBuilder._single without checks;
modify hasMultiTableSource to first verify that queryBuilder._statements is an
array before looping (if not, treat as multi-source and return true), and
similarly check that queryBuilder._single exists and that
queryBuilder._single.table is present and a string before using it (otherwise
return true). Update the logic in hasMultiTableSource to take the conservative
path (return true) whenever those private internals are missing or of unexpected
types so we avoid synchronous crashes.
- Around line 38-47: The current helper sets a single boolean when the FROM is
non-standard (queryBuilder._single.table) which later causes useSmartCount logic
to emit count(distinct ${tableName}.${idAttribute}) even for derived FROMs (FROM
(subquery) AS alias) where ${tableName} is out of scope; change the helper to
return a richer signal (e.g., { multiSource: true, derivedFrom: true } or two
booleans) or add a separate function/isDerivedFrom flag that specifically
detects a derived FROM (table is non-string or a Raw that wraps a subquery),
then update the downstream logic in useSmartCount (and the code path that builds
count(distinct ${tableName}.${idAttribute})) to check the derivedFrom flag and
avoid emitting table-scoped counts for derived FROMs (use count(*) or wrap the
query instead) while still allowing distinct id counts for multi-source/joins
when derivedFrom is false.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 82cc28a9-a75e-4ea2-b0f7-bbfc65ee0fc3
📒 Files selected for processing (2)
packages/bookshelf-pagination/lib/bookshelf-pagination.jspackages/bookshelf-pagination/test/pagination.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
- packages/bookshelf-pagination/test/pagination.test.js
The stub-based smart-count tests were testing our assumptions about knex's internal AST (`_statements` / `_single.table`), not knex itself — if knex shipped a version that reorganised those fields, the tests would keep passing and production would break. This replaces the 7 AST-stub tests with two layers of real-knex coverage: 1. Direct `hasMultiTableSource` tests against real knex QueryBuilders (no DB connection) — cover every branch including edge cases like UNION and a comma-containing `_single.table` string that can't be exercised cleanly end-to-end. 2. End-to-end `fetchPage` tests against a real bookshelf model backed by an in-memory sqlite database. Each test asserts both the chosen count aggregate (via knex's `query` event) and that the resulting total matches what the fetch actually returns — so "we picked count(*)" and "count(*) gives the right answer on this shape" are verified together. `hasMultiTableSource` is now exposed on `paginationUtils` so the direct tests can call it; it's used the same way `parseOptions` and `formatResponse` already are. Adds `bookshelf@1.2.0`, `knex@3.2.9` (matches the monorepo) and `sqlite3@5.1.7` as devDependencies. 31 tests, 100% line/branch/function coverage on `lib/bookshelf-pagination.js`.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
packages/bookshelf-pagination/lib/bookshelf-pagination.js (2)
38-47:⚠️ Potential issue | 🟠 MajorDerived/raw
FROMsources still assume the outer source is namedposts.Marking every non-string
tableas multi-source sends these queries down thecount(distinct posts.id)path, but that identifier is only valid when the derived/raw source exposes the model table name. The new integration case uses.as('posts'), so it doesn't cover.as('sub')or raw aliases whereposts.idis out of scope.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/bookshelf-pagination/lib/bookshelf-pagination.js` around lines 38 - 47, The code currently treats any non-string queryBuilder._single.table as multi-source and later emits count(distinct posts.id) which wrongly hardcodes the outer source name; instead, when table is a derived/raw object check whether it exposes an alias matching the model's table name (use the model's tableName, e.g. modelClass.prototype.tableName or queryBuilder._single.model && queryBuilder._single.model.prototype.tableName) and only treat it as multi-source if no such alias exists — otherwise use count(distinct <model.tableName>.id) dynamically; update the logic around queryBuilder._single.table and the count-generation to reference the model's actual tableName rather than the literal "posts".
25-47:⚠️ Potential issue | 🟠 MajorHarden
hasMultiTableSource()against non-Knex queryBuilder objects.
hasMultiTableSource()unconditionally dereferences_statementsand_single.tablebefore theuseBasicCountbranch is evaluated. Callers who explicitly setuseBasicCount: trueto skip smart-count logic will still fail if the queryBuilder doesn't expose these Knex internals (e.g., stubs, mocks, or custom query objects).Use optional chaining and defensive checks. If either
_statementsor_single.tablecannot be safely accessed, returntrueto treat the query as multi-source, which is the safer default:Suggested fix
function hasMultiTableSource(queryBuilder) { - for (const statement of queryBuilder._statements) { + const statements = Array.isArray(queryBuilder?._statements) ? queryBuilder._statements : null; + if (!statements) { + return true; + } + + for (const statement of statements) { // Any outer join duplicates the base row set. if (statement.grouping === 'join') { return true; } // UNION combines multiple SELECTs — the outer row set comes @@ - const table = queryBuilder._single.table; + const table = queryBuilder?._single?.table; + if (typeof table === 'undefined') { + return true; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/bookshelf-pagination/lib/bookshelf-pagination.js` around lines 25 - 47, The hasMultiTableSource function dereferences queryBuilder._statements and queryBuilder._single.table without guarding against non-Knex objects; modify hasMultiTableSource to first check that queryBuilder._statements is an array before iterating (if not, return true) and that queryBuilder._single exists and queryBuilder._single.table is safely readable (if not, return true), and use optional chaining (or explicit checks) so callers who set useBasicCount: true but pass mocks/stubs won't throw — keep the original join/union and "table includes comma" logic but only run them after these defensive checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@packages/bookshelf-pagination/lib/bookshelf-pagination.js`:
- Around line 38-47: The code currently treats any non-string
queryBuilder._single.table as multi-source and later emits count(distinct
posts.id) which wrongly hardcodes the outer source name; instead, when table is
a derived/raw object check whether it exposes an alias matching the model's
table name (use the model's tableName, e.g. modelClass.prototype.tableName or
queryBuilder._single.model && queryBuilder._single.model.prototype.tableName)
and only treat it as multi-source if no such alias exists — otherwise use
count(distinct <model.tableName>.id) dynamically; update the logic around
queryBuilder._single.table and the count-generation to reference the model's
actual tableName rather than the literal "posts".
- Around line 25-47: The hasMultiTableSource function dereferences
queryBuilder._statements and queryBuilder._single.table without guarding against
non-Knex objects; modify hasMultiTableSource to first check that
queryBuilder._statements is an array before iterating (if not, return true) and
that queryBuilder._single exists and queryBuilder._single.table is safely
readable (if not, return true), and use optional chaining (or explicit checks)
so callers who set useBasicCount: true but pass mocks/stubs won't throw — keep
the original join/union and "table includes comma" logic but only run them after
these defensive checks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 19b57a72-92b3-4597-9a5a-db4528e077e4
⛔ Files ignored due to path filters (1)
yarn.lockis excluded by!**/yarn.lock,!**/*.lock
📒 Files selected for processing (4)
packages/bookshelf-pagination/lib/bookshelf-pagination.jspackages/bookshelf-pagination/package.jsonpackages/bookshelf-pagination/test/pagination-integration.test.jspackages/bookshelf-pagination/test/pagination.test.js
✅ Files skipped from review due to trivial changes (1)
- packages/bookshelf-pagination/package.json
The direct hasMultiTableSource describe block already exercises every query shape against real knex — the e2e suite was 1-for-1 duplication for six of its eight tests. Kept only the two that carry unique weight: 1. An inner join that duplicates rows (p1 × two tags → 3 physical rows for 2 published posts). This is the only shape where count(*) and count(distinct) actually produce different totals, so it's the only e2e test that would catch a "we picked the wrong aggregate" bug at the result level. 2. A subquery JOIN in WHERE — the named regression this PR is about, kept as an explicit end-to-end guard. Dropped the plain-filter, joinRaw, derived-table, CTE, useBasicCount and default-distinct cases from the e2e suite; all are already covered more thoroughly by the direct hasMultiTableSource tests above.
Summary
hasMultiTableSourcematched/\bjoin\b/iagainst the fully compiled SQL, so awhere id in (select … inner join …)forcedcount(distinct posts.id)even when the outer query was still single-table. On a 267k-post Ghost site filtered by author this was a measured ~20% regression on the count aggregate (316ms vs 257ms) for zero correctness benefit — the weedout step already guaranteed distinct rows._statements+_single.table) instead of regex-matching compiled SQL. Joins, unions, derived tables and comma-separated FROM lists are all read as structured data, and JOINs nested inWHERE id IN (…)subqueries live under awheregrouping and are correctly ignored._statements/_single— no SQL fallback needed. The whole check collapses to a ~30-line AST walk: no regexes, notoSQL()call, no paren-stripping, no derived helpers.Tests
Added
bookshelf@1.2.0,knex@3.2.9andsqlite3@5.1.7as devDependencies so the plugin can be tested against real knex/bookshelf instances instead of hand-rolled stubs. 25 tests total, 100% line/branch/function coverage onlib/bookshelf-pagination.js.test/pagination.test.js) — pagination metadata math,limit='all',transactingpassthrough, error wrapping (NotFound / BadRequest / unknown), and theparseOptions/formatResponseutilities. None of these need a real DB.hasMultiTableSourcetests (test/pagination-integration.test.js) — call the function against real knex QueryBuilders (no DB connection) and cover every branch: plain single-table, subquery-in-WHERE, innerJoin, leftJoin/joinRaw, UNION, derived table in FROM,fromRawwith multiple tables, comma-string table, and CTE-only. These pin the plugin's assumptions about knex's AST shape to the version of knex actually installed, and cover edge cases that can't be driven throughfetchPageend-to-end (UNION + aggregate is semantically weird; a comma-string table fails at sqlite execution).fetchPagetests — the only two scenarios that carry unique weight on top of the direct suite:count(*)andcount(distinct)actually produce different totals, so it's the only test that catches a "we picked the wrong aggregate" bug at the result level.fetchPage.Each e2e test listens on knex's
queryevent to assert which count aggregate the plugin chose, and assertspagination.totalagainst a value that's only correct if that aggregate was picked — so "we picked count(*)" and "the result is right" are verified together.Notes on upstream consumers
This is strictly safe for anyone already using
useSmartCount: true: every case the old code flagged as multi-table is still flagged by the AST walk. The change only removes false positives (subquery JOINs) and adds one previously-missed case (derived tables in FROM).useBasicCountand the default distinct-count behaviour are untouched.