Fused fast-paths: address external review correctness findings by singaraiona · Pull Request #198 · RayforceDB/rayforce

singaraiona · 2026-05-07T11:31:04Z

Summary

Branched from master to address the findings in an external code review of the fused-paths work (fp_eval_cmp / fused_group / fused_topk) — surfacing predicate-constant truncation, signed-narrow-agg reads, top-K alias drop, top-K nullmap drop, WHERE-AND reorder past fault-able conjuncts, a SYM-LIKE seen[] data race, mismatch between planner and executor support, and missing nullable / temporal gating. All findings are addressed; the original ClickBench-driven performance work stays on master unchanged.

Per-finding mapping

Review finding	Commit	Fix
#3 Constant truncation (Critical)	`93206a39`	`fp_compile_cmp` range-checks `cval` against the column's representable range and pre-folds out-of-range to all-true / all-false via new `fp_fold_t`.
#4 Signed narrow agg reads (High)	`93206a39`	New `read_signed_by_esz` + per-agg `in_unsigned` flag. SUM/MIN/MAX/AVG over I16/I32/DATE/TIME/TIMESTAMP sign-extend correctly.
#1 Top-K alias drop (High)	`93206a39`, `152f3917`	Round-1 gated off renames; round-3 plumbs `out_alias_syms[]` through to `ray_fused_topk_select` so the materialiser publishes under the alias.
#2 Top-K null materialise (High)	`93206a39`, `152f3917`	Round-1 gated off nullable cols; round-3 propagates nullmaps via `ray_vec_set_null` per gathered row.
#6 WHERE-AND reorder (Medium)	`93206a39`	`reorder_safe` flag — fault-able / side-effecting ops keep user-given order; cheap-first sort only when all conjuncts are pure comparison/logic.
#7 SYM-LIKE `seen[]` race (Medium)	`93206a39`	`__atomic_store_n(..., __ATOMIC_RELAXED)` — same x86 codegen, standard-defined.
Adjacent	`93206a39`	CSV step 9c strips `RAY_ATTR_HAS_NULLS` when post-remap nullmap is empty.
Nullable cols in fused predicate / key / agg-input (round-2 followup)	`e2b863a3`	New `fp_col_supported()` + same gate in `mk_compile`; query.c planner mirrors.
Planner ↔ executor support divergence (round-2 followup)	`e2b863a3`	Shared `fp_atom_col_compatible()` helper used by both `fp_check_simple_cmp` (planner) and `fp_compile_cmp` (executor).
Mixed temporal types (round-2 followup)	`e2b863a3`	DATE/TIME/TIMESTAMP cols require same-typed atom RHS; cross-temporal compares fall back to OP_GROUP.
#9 Top-K tie ordering (Medium)	`950b0fed`	`fpk_cmp` breaks all-keys-tied ties by source row index ascending — deterministic stable-sort prefix semantics.
#8 Runtime fallback (Medium)	`950b0fed`	`exec_filtered_group` wraps the fused dispatch with a graceful unfused FILTER + GROUP fallback when the executor returns `nyi`.
#5 CSV empty-SYM policy (Medium / High)	`950b0fed`	Documented inline; opt-out via `RAYFORCE_CSV_EMPTY_SYM_NULL=1` to retain pre-2026-05 nullable-empty semantics.
#4 (recommendation) Centralise thresholds	`b5af5ada`	`FPK_MAX_K`, `FP_COMBINE_PAR_MIN`, `LIKE_PAR_MIN_ROWS_STR`, `LIKE_PAR_MIN_ROWS_SYM` named with rationale.
#2 (recommendation) Adversarial tests	`04b3a2ae`, `152f3917`	I16/I32 boundary constants, full-range SUM/MIN/MAX, multi-key composite with negative I16, top-K with negative sort key, top-K nullmap propagation, top-K alias rename.
#10 Test coverage	every commit	`test/rfl/integration/fused_group_parity.rfl` + 3 regression unit tests (`eq_const_out_of_range_u8`, `ne_const_out_of_range_u8`, `sum_negative_i16`).

Still flagged (deliberate, with TODOs)

Top-K sort-key nullable — fpk_cmp still rejects nullable sort keys. Adding null-policy (NULLS FIRST / NULLS LAST) needs a planner-level surface; the TODO in fpk_cmp documents what's needed.
Multi-process fused/unfused parity harness — the deterministic in-process parity test catches the surfaced regressions, but a forked-process harness running each query under RAYFORCE_DISABLE_FUSED_* would be stronger coverage.

Test plan

make release builds clean with -Werror.
make test — 2343/2345 pass, 1 skipped, 1 pre-existing failure (sym/like_fn/sym_vec_null_sym, also fails on master).
test/rfl/integration/fused_group_parity.rfl — every fused / unfused divergence the review surfaced has a deterministic regression case here.
make ClickBench bench harness — cluster ratios at parity with master (no perf regression from the gates / fallback / threshold renaming).

The fused predicate / agg / top-K paths landed with several fast-path-only assumptions that broke for non-benchmark-shaped inputs. This commit closes the gaps an external review flagged: * fp_eval_cmp constant truncation fp_compile_cmp now range-checks the decoded `cval` against the column's representable range and pre-folds the comparison to all- true / all-false when the constant lies outside it. Previously `u8_col == 300` was silently rewritten to `u8_col == 44` (300 cast to uint8_t) and matched whatever rows held value 44. fp_eval_cmp now short-circuits via the new fp_fold_t enum before the per-row width-cast compare runs. * signed narrow aggregates read as unsigned Added read_signed_by_esz and a per-agg in_unsigned flag. SUM / MIN / MAX / AVG over an I16 / I32 / DATE / TIME / TIMESTAMP column now sign-extend correctly, so a stored -1 reads as -1 and not 65535. read_by_esz remains in place for SYM dictionary IDs and composite-key packing where the bit pattern is what matters. * fused top-K alias drop The fused materialiser names output columns after the source symbol — schemas with renames like `{alias: source_col}` would silently emit a column named `source_col`. Tighten the planner gate to require alias == source until the fused exec accepts a separate aliases array. * fused top-K nullable columns The fused materialiser does not propagate source nullmaps and the heap comparator does not implement null ordering. Gate the fused path off when any sort key or output column carries RAY_ATTR_HAS_NULLS. * WHERE-AND reorder past fault-able conjuncts The cost-based reorder previously reshuffled any conjuncts that compiled to a vector. An expression like `(and (!= y 0) (< (/ x y) 10))` could be reordered, exposing the divide-by-zero. Track a reorder_safe flag — when any conjunct contains an op outside the comparison/logic vocabulary (OP_DIV/OP_MOD/user-call/etc.) we still chain the conjuncts as separate filters but evaluate them in user-given order. * SYM-LIKE seen[] data race The seen-mark workers wrote a plain non-atomic 1 to a shared byte; even with idempotent value, that's UB in C/C++. Switch to __atomic_store_n with __ATOMIC_RELAXED — same x86 codegen as a plain byte store but standard-defined. * CSV nullmap strip after empty-SYM remap csv_remap_empty_sym_to_id clears null bits for empty SYM rows but did not flip col_had_null, so RAY_ATTR_HAS_NULLS stuck on columns whose nullmap was actually all-zero. Step 9c now also walks the post-remap nullmap and strips the attribute when no bits remain set. This unblocks the new fused-top-K nullable gate for ClickBench-shaped SYM columns that were never really nullable. Regression coverage in test_fused_group: - eq_const_out_of_range_u8: u8 == 300 must match 0 rows - ne_const_out_of_range_u8: u8 != 300 must match every row - sum_negative_i16: SUM(-1, -2, 3, 4) must equal 4 (not 131076)

Round 1 left the deeper structural concerns from the review intact: nullable predicate / key / agg-input columns were still accepted, the planner-side support check diverged from the executor compile, and mixed temporal types could pass through. This commit closes them. Nullable column gates (was: only top-K sort/output checked): - fp_compile_cmp now rejects predicate columns with RAY_ATTR_HAS_NULLS via a new fp_col_supported() helper — the fused per-row evaluator reads raw payload, so a sentinel slot would diverge from the unfused null-aware kernel. - mk_compile rejects HAS_NULLS on every group key column (the composite key compose can't distinguish a null sentinel from a legitimate value with the same bit pattern) and on every aggregate input column (raw reads would corrupt SUM/MIN/MAX/AVG). - The query.c planner gate mirrors both checks so the planner never enables fused for a shape the executor would reject. Planner / executor symmetry (was: planner accepted any atom RHS): - New helper fp_atom_col_compatible(atom_type, col_type) shared by fp_check_simple_cmp (planner) and fp_compile_cmp (executor). Gates that previously diverged now check identical conditions: the planner stops handing nyi shapes to the executor at runtime. Mixed temporal rejection (was: DATE/TIME/TIMESTAMP all interchangeable): - fp_atom_col_compatible requires DATE/TIME/TIMESTAMP cols to have a same-typed atom on the RHS. The fused compare reads raw stored units (days vs microseconds vs nanoseconds), so a cross-temporal constant would have produced meaningless inequalities. Bail to OP_GROUP, which normalises temporals before the compare. Coverage (test/rfl/integration/fused_group_parity.rfl): - U8 column with the value 44 + predicate `u8 == 300` ⇒ 0 rows (would match 44 if the truncation fold were missing). - I16 column with negative values: SUM = 5 (-1+-2+3+4+-5+6), MIN = -5, MAX = 6 (would diverge if the signed read were off). - Tautology / contradiction folds on out-of-range U8 constants. - Chained AND with cheap+expensive conjuncts agrees regardless of reorder choice. Tests: 2342 / 2344 passing (1 skipped, 1 pre-existing failure). Bench cluster ratios at parity with the prior wide-key state.

Round 2 closed structural gates; round 3 addresses remaining concerns the review explicitly flagged. Top-K tie ordering (#9): fpk_cmp now breaks all-keys-tied ties by source row index. Previously the comparator returned 0 on tie, so worker-pool scheduling non-determinism could cause the K-best heap to retain different rows across runs of the same query against the same data. The new tie-break is direction-independent: for both ASC and DESC top-K we want the K rows with the smallest source indices to survive — that matches a stable sort's prefix. Runtime fallback symmetry (#8): exec_filtered_group now wraps the count1 / multi dispatch with a graceful unfused fallback. When fp_compile_pred or mk_compile rejects the shape (returning a 'nyi' error), we build a fresh FILTER + GROUP subgraph from the same predicate / keys / agg-input ops on the OP_FILTERED_GROUP node and execute it via ray_execute. Defence in depth — the round-2 planner gate tightening should prevent exec from seeing unsupported shapes, but if a future change introduces a divergence we degrade to a slower-but-correct result instead of surfacing a user-visible error. CSV empty-SYM policy documentation + opt-out (#5): The empty-CSV-field → empty-symbol remap is now an explicit, documented policy. Default behaviour is unchanged (matches DuckDB / Spark / polars). Setting RAYFORCE_CSV_EMPTY_SYM_NULL=1 reverts to the pre-2026-05 semantics where empty fields stay nullable. The block comment around csv.c:599 spells out the trade-off and notes that the SYM type can't carry a separate empty/null distinction on the value side. Extended parity tests (#10): test/rfl/integration/fused_group_parity.rfl now also covers: - Nullable predicate columns (gate routes to unfused, result matches null-aware semantics). - Nullable agg-input column (sum is null-aware). - Nullable group key column (nulls bucket separately). - DATE column compared to a same-typed atom (works); the gate prevents mixed temporal compares from firing fused. - Top-K aliased projection (alias preserved in output schema). - Top-K tie ordering (deterministic source-order prefix). Tests: 2343 / 2345 passing (1 skipped, 1 pre-existing failure). Bench cluster ratios at parity with the prior round-2 state — no performance regression from the additional gates and fallback.

The review flagged a number of bare numeric constants threaded through the fused fast paths. Move them behind named macros with inline rationale so future tuning has a single source of truth and new readers can see the why instead of just the value. FPK_MAX_K (8192) Top-K bound — couples to a stack-resident global_idx[FPK_MAX_K] in fused_topk_select. Larger K falls through to unfused FILTER + SORT + TAKE which has no buffer constraint. Used by both query.c gates (planner) and fused_topk.c (executor). FP_COMBINE_PAR_MIN (50000) Crossover row count below which the parallel 3-pass radix combine in fused_group loses to the serial walk because of fixed dispatch + scratch alloc cost. LIKE_PAR_MIN_ROWS_STR (200000) LIKE_PAR_MIN_ROWS_SYM (100000) Parallelism crossover for OP_LIKE. STR scans match per-row; SYM scans match per-dict-entry, so SYM parallelises well at lower row counts. FP_SHARD_INIT_CAP / FP_SHARD_MAX_CAP keep their values but pick up a comment block explaining what the bound protects (per-shard memory budget) and what happens at the boundary (OOM → unfused fallback). No behaviour change — every replaced expression resolves to the same integer at the same point in the code. Tests: 2343 / 2345 (1 skipped, 1 pre-existing failure). Bench cluster ratios at parity with the prior round-3 state.

Extend the parity test with width-boundary and multi-key shapes that the review's recommendation #2 ("adversarial random-data tests") explicitly called out: - I16 boundary constants on both ends (INT16_MIN-1, INT16_MAX+1) plus the actual MIN/MAX values living in the column. All folds go to all-true / all-false consistent with arithmetic semantics. - I16 SUM/MIN/MAX with the full range: -32768 + -1 + 0 + 1 + 32767 = -1, MIN = -32768, MAX = 32767. Catches a regression in sign-extension across all four signed agg kinds. - I32 boundary (INT32_MAX + 1 == 2147483648 must fold to zero matches; full-range SUM = -1). - Multi-key composite (I16 + I32) with negative I16 component — exercises the narrow-path key compose that packs both keys into a single int64 slot. Distinct-pair count and sum agree with the unfused result. - Top-K with a negative-valued I32 sort key — verifies signed reads in the heap comparator. Tests: 2343 / 2345 passing.

Lift two of the conservative gates the round-1 review fix added: * Output column aliases. The previous round forced the planner to refuse fused top-K for any rename (`{alias: source_col}`). The exec now takes an explicit out_alias_syms[] array and uses it when adding columns to the result table — source col is still what we gather from, alias is what we publish under. * Nullable output columns. The materialise loop now calls ray_vec_set_null per gathered row when the source column carries RAY_ATTR_HAS_NULLS, so nullable select columns survive a fused top-K without losing the null bits. ray_vec_set_null lazily allocates dst's nullmap on the first set, so we only pay the alloc cost when there are actual nulls in the result. The sort-key nullable gate stays in place — fpk_cmp doesn't yet implement null ordering (nulls-last / nulls-first), and adding it needs a planner-level NULLS FIRST/LAST surface. The TODO in fpk_cmp documents what's needed to lift that gate too. CSV empty-SYM policy comment cleaned up — the previous text pointed to a nonexistent ROADMAP.md; the trade-off is now described inline. Parity tests cover both lifts: round-trip null in a nullable output column through top-K, and round-trip an alias name through top-K. Tests: 2343 / 2345 passing.

Six follow-up issues from the second-round review of `fused-correctness-review`. All four blockers fixed, both smells fixed, plus regression coverage for the two correctness blockers. Blocker #1 — fallback silently dropped the filter predicate exec_filtered_group_fallback was building `ray_const_table → ray_filter` then overwriting root with `ray_group(...)`, but ray_group has arity 0 and consumes `g->table` directly — the filter node was orphaned. Worse, naively chaining ray_execute(filter) then ray_execute(group) doesn't preserve the selection either: the outer ray_execute compacts and clears g->selection on return, so the group call sees the unfiltered g->table. Fix: take the materialised filtered table from the first ray_execute, swap it in as g->table for the duration of the group call, restore afterwards. Blocker #2 — count1 path accepted nullable group keys Planner gate already rejected this in query.c, but a direct ray_filtered_group() C-API call bypassed the planner. Mirror mk_compile's HAS_NULLS check at exec_filtered_group_count1. Blocker #3 — top-K materialiser corrupted unsupported output types ray_fused_topk_select accepted any column type but the materialise loop reads via read_by_esz / writes via write_col_i64, which only handle scalar BOOL/U8/I16/I32/I64/DATE/TIME/TIMESTAMP/SYM. RAY_STR / GUID / list-shape would gather the raw int64 of the heap-string pointer and write a single byte, producing garbage. Fix: gate output column types at both the planner gate (query.c) and the executor entry (ray_fused_topk_select). Unsupported types fall back to the unfused FILTER + SORT + TAKE. Blocker #4 — top-K sort-key vector parsing ignored SYM width Casting `ray_data(v)` to `int64_t*` is only valid for W64. Compact W8/W16/W32 SYM vectors would read garbage. Use ray_read_sym for the width-correct load. Smell #5 — STR LIKE used the SYM threshold Earlier centralisation accidentally swapped the names; STR path used LIKE_PAR_MIN_ROWS_SYM and SYM path used LIKE_PAR_MIN_ROWS_STR. Restore the correct mapping. Smell #6 — mixed-temporal test was actually same-typed The parity test claimed to cover DATE-vs-TIMESTAMP but used DATE-vs-DATE. Replaced with same-typed DATE compares (which the fused gate accepts) and a comment explaining why a real cross- temporal compare exercises an unsupported engine path that's outside the scope of this branch. Regression coverage: test_fused_group.c - fused_group/fallback_filter_honored: builds a fused node with a nullable agg input (forces fallback) under a selective WHERE; asserts the SUM equals the filtered value, not the unfiltered. - fused_group/count1_rejects_nullable_key: builds a fused node with a nullable group key; verifies the result matches the null-aware unfused group (3 buckets, not 4 with a phantom). fused_group_parity.rfl - Top-K STR output column lex-order check (forces unfused fallback path, asserts correctness). - Same-typed DATE compares verify the fused predicate gate still fires for the supported shape. Tests: 2345 / 2347 passing (1 skipped, 1 pre-existing failure). Bench cluster ratios within noise of the prior state.

The fused top-K comparator resolves SYM IDs through the global sym_strings snapshot, and the materialiser builds a fresh ray_sym_vec_new without copying src->sym_dict. Both assume global-dict storage. Columns carrying a per-vector sym_dict (CSV-loaded SYM cols, splayed slices, IPC-deserialised SYMs) store LOCAL indices that don't map to the global table — running them through the fused path can mis-order results and produce output cols that resolve against the wrong dictionary. Reject local-dict SYM in both gates so the unfused FILTER + SORT + TAKE handles them; that path propagates sym_dict via the guarded slice-aware pattern in sort.c:3642-3660 / rerank.c:174-188. - src/ops/fused_topk.c: gate sort-key SYM and output SYM cols on dict_owner->sym_dict (slice_parent-aware). - src/ops/query.c: planner-side mirror for output cols (early fallback before DAG build). Also cleaned up a parity-test gap surfaced in the same review: the mixed-temporal section's comment claimed DATE-vs-TIMESTAMP rejection coverage but the assertions were same-typed DATE. Added an actual DATE-vs-TIMESTAMP probe asserting count 0 (the unfused-path result; the fused raw-int compare would have matched all 3 rows). New regression test exercises the new gate via a CSV-loaded SYM column inserted in non-alphabetic order — top-K asc must return alphabetic order, which only the unfused path delivers.

…lish) Round-5 review's only remaining note: the mixed-temporal regression test asserted count == 0 with a comment claiming raw fused would match all 3 rows, but raw int32-days >= int64-ns is uniformly false for any plausible TS literal — so 0 also matches the raw answer and the test couldn't catch a regression that removes the cross-temporal gate. I tried flipping the column class (TIMESTAMP col vs DATE const) so raw fused would match all rows while a normalised compare would match a subset — but rayforce's unfused engine also compares temporals by raw stored ints today, so both paths produce the same count. No (op, lhs-class, rhs-class, value) tuple in current engine semantics distinguishes raw-fused from generic for cross- temporal compares. Updated the test to: - Use the TIMESTAMP-col vs DATE-const direction so the comment is at least concrete about which raw answer is being pinned. - Document the limitation honestly: this test verifies the gate routes cross-temporal to the unfused path without crashing or producing corrupted output, NOT that the result differs from raw fused (because the engine has no proper temporal normalisation today — that's tracked separately). Tests still 2345 / 2347 pass.

The round-1 review fix gated fused top-K off whenever any sort key column carried RAY_ATTR_HAS_NULLS, because fpk_cmp read the raw payload without a null-policy leg. This commit adds that leg and lifts the planner gate, matching sort.c's default policy: ASC → NULLS LAST (null is "worse", evicted first in the max-heap of K-best candidates) DESC → NULLS FIRST (null is "better", retained first) Implementation: - fpk_keyspec_t carries a per-key has_nulls flag (set at compile time from col->attrs) and a back-pointer to the column for ray_vec_is_null. Runtime check is gated by has_nulls so the common no-null path is byte-equivalent to before. - fpk_cmp short-circuits the null leg before reading the raw payload — both-null falls through to the next key (or the source-row tie break), single-null returns immediately with the policy-driven sign. - query.c sort-key gate now only verifies the column exists. Nullable sort keys are accepted for fused top-K. Coverage in fused_group_parity.rfl: ASC top-5 with two nulls verifies they sort last with stable source-order tie break; ASC top-3 verifies nulls are excluded from a smaller K when LAST policy is in effect; DESC top-3 verifies the inverse FIRST policy. Future: caller-specified NULLS FIRST / NULLS LAST. The default policy is hard-wired here; an explicit clause in the query DSL would thread through fpk_keyspec_t as a third orientation flag. Tests: 2345 / 2347 passing.

* macOS clang -Werror,-Wc23-extensions src/ops/fused_group.c:2365 had `materialize:` immediately followed by an array declaration. C23 allows that but pre-C23 requires a statement after the label, and clang on macOS rejects the extension under -Werror. Insert an empty statement so all four matrix variants (ubuntu/macos × debug/release) build clean. * sym/like_fn/sym_vec_null_sym test failure The fast-path LUT projection in ray_like_fn returned 0 for any out-of-range sym ID. The OOM fallback path right below it (and the atom branch above) treats unknown / null-resolving sids as empty strings — the test's commented expectation "NULL→\"\" also matches *" is correct semantics, the fast path was just inconsistent. Compute the empty-string match once (use_simple or full glob), cache as empty_match, and let the row pass return that for out-of-range sids. Now 2346 / 2347 pass with 0 failures (1 skipped). Closes the matrix red on the branch so CI is green when reopened.

singaraiona added 10 commits May 7, 2026 13:23

singaraiona closed this May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused fast-paths: address external review correctness findings#198

Fused fast-paths: address external review correctness findings#198
singaraiona wants to merge 10 commits intomasterfrom
fused-correctness-review

singaraiona commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

singaraiona commented May 7, 2026

Summary

Per-finding mapping

Still flagged (deliberate, with TODOs)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant