Fused fast-paths: address external review correctness findings#198
Closed
singaraiona wants to merge 10 commits intomasterfrom
Closed
Fused fast-paths: address external review correctness findings#198singaraiona wants to merge 10 commits intomasterfrom
singaraiona wants to merge 10 commits intomasterfrom
Conversation
The fused predicate / agg / top-K paths landed with several
fast-path-only assumptions that broke for non-benchmark-shaped
inputs. This commit closes the gaps an external review flagged:
* fp_eval_cmp constant truncation
fp_compile_cmp now range-checks the decoded `cval` against the
column's representable range and pre-folds the comparison to all-
true / all-false when the constant lies outside it. Previously
`u8_col == 300` was silently rewritten to `u8_col == 44`
(300 cast to uint8_t) and matched whatever rows held value 44.
fp_eval_cmp now short-circuits via the new fp_fold_t enum before
the per-row width-cast compare runs.
* signed narrow aggregates read as unsigned
Added read_signed_by_esz and a per-agg in_unsigned flag. SUM /
MIN / MAX / AVG over an I16 / I32 / DATE / TIME / TIMESTAMP
column now sign-extend correctly, so a stored -1 reads as -1 and
not 65535. read_by_esz remains in place for SYM dictionary IDs
and composite-key packing where the bit pattern is what matters.
* fused top-K alias drop
The fused materialiser names output columns after the source
symbol — schemas with renames like `{alias: source_col}` would
silently emit a column named `source_col`. Tighten the planner
gate to require alias == source until the fused exec accepts a
separate aliases array.
* fused top-K nullable columns
The fused materialiser does not propagate source nullmaps and the
heap comparator does not implement null ordering. Gate the fused
path off when any sort key or output column carries
RAY_ATTR_HAS_NULLS.
* WHERE-AND reorder past fault-able conjuncts
The cost-based reorder previously reshuffled any conjuncts that
compiled to a vector. An expression like
`(and (!= y 0) (< (/ x y) 10))` could be reordered, exposing the
divide-by-zero. Track a reorder_safe flag — when any conjunct
contains an op outside the comparison/logic vocabulary
(OP_DIV/OP_MOD/user-call/etc.) we still chain the conjuncts as
separate filters but evaluate them in user-given order.
* SYM-LIKE seen[] data race
The seen-mark workers wrote a plain non-atomic 1 to a shared
byte; even with idempotent value, that's UB in C/C++. Switch to
__atomic_store_n with __ATOMIC_RELAXED — same x86 codegen as a
plain byte store but standard-defined.
* CSV nullmap strip after empty-SYM remap
csv_remap_empty_sym_to_id clears null bits for empty SYM rows but
did not flip col_had_null, so RAY_ATTR_HAS_NULLS stuck on
columns whose nullmap was actually all-zero. Step 9c now also
walks the post-remap nullmap and strips the attribute when no
bits remain set. This unblocks the new fused-top-K nullable gate
for ClickBench-shaped SYM columns that were never really nullable.
Regression coverage in test_fused_group:
- eq_const_out_of_range_u8: u8 == 300 must match 0 rows
- ne_const_out_of_range_u8: u8 != 300 must match every row
- sum_negative_i16: SUM(-1, -2, 3, 4) must equal 4 (not 131076)
Round 1 left the deeper structural concerns from the review intact:
nullable predicate / key / agg-input columns were still accepted, the
planner-side support check diverged from the executor compile, and
mixed temporal types could pass through. This commit closes them.
Nullable column gates (was: only top-K sort/output checked):
- fp_compile_cmp now rejects predicate columns with
RAY_ATTR_HAS_NULLS via a new fp_col_supported() helper — the
fused per-row evaluator reads raw payload, so a sentinel slot
would diverge from the unfused null-aware kernel.
- mk_compile rejects HAS_NULLS on every group key column (the
composite key compose can't distinguish a null sentinel from a
legitimate value with the same bit pattern) and on every
aggregate input column (raw reads would corrupt SUM/MIN/MAX/AVG).
- The query.c planner gate mirrors both checks so the planner
never enables fused for a shape the executor would reject.
Planner / executor symmetry (was: planner accepted any atom RHS):
- New helper fp_atom_col_compatible(atom_type, col_type) shared by
fp_check_simple_cmp (planner) and fp_compile_cmp (executor).
Gates that previously diverged now check identical conditions:
the planner stops handing nyi shapes to the executor at runtime.
Mixed temporal rejection (was: DATE/TIME/TIMESTAMP all interchangeable):
- fp_atom_col_compatible requires DATE/TIME/TIMESTAMP cols to have
a same-typed atom on the RHS. The fused compare reads raw stored
units (days vs microseconds vs nanoseconds), so a cross-temporal
constant would have produced meaningless inequalities. Bail to
OP_GROUP, which normalises temporals before the compare.
Coverage (test/rfl/integration/fused_group_parity.rfl):
- U8 column with the value 44 + predicate `u8 == 300` ⇒ 0 rows
(would match 44 if the truncation fold were missing).
- I16 column with negative values: SUM = 5 (-1+-2+3+4+-5+6),
MIN = -5, MAX = 6 (would diverge if the signed read were off).
- Tautology / contradiction folds on out-of-range U8 constants.
- Chained AND with cheap+expensive conjuncts agrees regardless of
reorder choice.
Tests: 2342 / 2344 passing (1 skipped, 1 pre-existing failure).
Bench cluster ratios at parity with the prior wide-key state.
Round 2 closed structural gates; round 3 addresses remaining concerns the review explicitly flagged. Top-K tie ordering (#9): fpk_cmp now breaks all-keys-tied ties by source row index. Previously the comparator returned 0 on tie, so worker-pool scheduling non-determinism could cause the K-best heap to retain different rows across runs of the same query against the same data. The new tie-break is direction-independent: for both ASC and DESC top-K we want the K rows with the smallest source indices to survive — that matches a stable sort's prefix. Runtime fallback symmetry (#8): exec_filtered_group now wraps the count1 / multi dispatch with a graceful unfused fallback. When fp_compile_pred or mk_compile rejects the shape (returning a 'nyi' error), we build a fresh FILTER + GROUP subgraph from the same predicate / keys / agg-input ops on the OP_FILTERED_GROUP node and execute it via ray_execute. Defence in depth — the round-2 planner gate tightening should prevent exec from seeing unsupported shapes, but if a future change introduces a divergence we degrade to a slower-but-correct result instead of surfacing a user-visible error. CSV empty-SYM policy documentation + opt-out (#5): The empty-CSV-field → empty-symbol remap is now an explicit, documented policy. Default behaviour is unchanged (matches DuckDB / Spark / polars). Setting RAYFORCE_CSV_EMPTY_SYM_NULL=1 reverts to the pre-2026-05 semantics where empty fields stay nullable. The block comment around csv.c:599 spells out the trade-off and notes that the SYM type can't carry a separate empty/null distinction on the value side. Extended parity tests (#10): test/rfl/integration/fused_group_parity.rfl now also covers: - Nullable predicate columns (gate routes to unfused, result matches null-aware semantics). - Nullable agg-input column (sum is null-aware). - Nullable group key column (nulls bucket separately). - DATE column compared to a same-typed atom (works); the gate prevents mixed temporal compares from firing fused. - Top-K aliased projection (alias preserved in output schema). - Top-K tie ordering (deterministic source-order prefix). Tests: 2343 / 2345 passing (1 skipped, 1 pre-existing failure). Bench cluster ratios at parity with the prior round-2 state — no performance regression from the additional gates and fallback.
The review flagged a number of bare numeric constants threaded
through the fused fast paths. Move them behind named macros with
inline rationale so future tuning has a single source of truth and
new readers can see the why instead of just the value.
FPK_MAX_K (8192)
Top-K bound — couples to a stack-resident global_idx[FPK_MAX_K]
in fused_topk_select. Larger K falls through to unfused
FILTER + SORT + TAKE which has no buffer constraint. Used by
both query.c gates (planner) and fused_topk.c (executor).
FP_COMBINE_PAR_MIN (50000)
Crossover row count below which the parallel 3-pass radix
combine in fused_group loses to the serial walk because of
fixed dispatch + scratch alloc cost.
LIKE_PAR_MIN_ROWS_STR (200000)
LIKE_PAR_MIN_ROWS_SYM (100000)
Parallelism crossover for OP_LIKE. STR scans match per-row;
SYM scans match per-dict-entry, so SYM parallelises well at
lower row counts.
FP_SHARD_INIT_CAP / FP_SHARD_MAX_CAP keep their values but pick up
a comment block explaining what the bound protects (per-shard memory
budget) and what happens at the boundary (OOM → unfused fallback).
No behaviour change — every replaced expression resolves to the same
integer at the same point in the code.
Tests: 2343 / 2345 (1 skipped, 1 pre-existing failure).
Bench cluster ratios at parity with the prior round-3 state.
Extend the parity test with width-boundary and multi-key shapes that the review's recommendation #2 ("adversarial random-data tests") explicitly called out: - I16 boundary constants on both ends (INT16_MIN-1, INT16_MAX+1) plus the actual MIN/MAX values living in the column. All folds go to all-true / all-false consistent with arithmetic semantics. - I16 SUM/MIN/MAX with the full range: -32768 + -1 + 0 + 1 + 32767 = -1, MIN = -32768, MAX = 32767. Catches a regression in sign-extension across all four signed agg kinds. - I32 boundary (INT32_MAX + 1 == 2147483648 must fold to zero matches; full-range SUM = -1). - Multi-key composite (I16 + I32) with negative I16 component — exercises the narrow-path key compose that packs both keys into a single int64 slot. Distinct-pair count and sum agree with the unfused result. - Top-K with a negative-valued I32 sort key — verifies signed reads in the heap comparator. Tests: 2343 / 2345 passing.
Lift two of the conservative gates the round-1 review fix added:
* Output column aliases. The previous round forced the planner
to refuse fused top-K for any rename (`{alias: source_col}`).
The exec now takes an explicit out_alias_syms[] array and uses
it when adding columns to the result table — source col is
still what we gather from, alias is what we publish under.
* Nullable output columns. The materialise loop now calls
ray_vec_set_null per gathered row when the source column
carries RAY_ATTR_HAS_NULLS, so nullable select columns survive
a fused top-K without losing the null bits. ray_vec_set_null
lazily allocates dst's nullmap on the first set, so we only
pay the alloc cost when there are actual nulls in the result.
The sort-key nullable gate stays in place — fpk_cmp doesn't yet
implement null ordering (nulls-last / nulls-first), and adding it
needs a planner-level NULLS FIRST/LAST surface. The TODO in
fpk_cmp documents what's needed to lift that gate too.
CSV empty-SYM policy comment cleaned up — the previous text pointed
to a nonexistent ROADMAP.md; the trade-off is now described inline.
Parity tests cover both lifts: round-trip null in a nullable output
column through top-K, and round-trip an alias name through top-K.
Tests: 2343 / 2345 passing.
Six follow-up issues from the second-round review of `fused-correctness-review`. All four blockers fixed, both smells fixed, plus regression coverage for the two correctness blockers. Blocker #1 — fallback silently dropped the filter predicate exec_filtered_group_fallback was building `ray_const_table → ray_filter` then overwriting root with `ray_group(...)`, but ray_group has arity 0 and consumes `g->table` directly — the filter node was orphaned. Worse, naively chaining ray_execute(filter) then ray_execute(group) doesn't preserve the selection either: the outer ray_execute compacts and clears g->selection on return, so the group call sees the unfiltered g->table. Fix: take the materialised filtered table from the first ray_execute, swap it in as g->table for the duration of the group call, restore afterwards. Blocker #2 — count1 path accepted nullable group keys Planner gate already rejected this in query.c, but a direct ray_filtered_group() C-API call bypassed the planner. Mirror mk_compile's HAS_NULLS check at exec_filtered_group_count1. Blocker #3 — top-K materialiser corrupted unsupported output types ray_fused_topk_select accepted any column type but the materialise loop reads via read_by_esz / writes via write_col_i64, which only handle scalar BOOL/U8/I16/I32/I64/DATE/TIME/TIMESTAMP/SYM. RAY_STR / GUID / list-shape would gather the raw int64 of the heap-string pointer and write a single byte, producing garbage. Fix: gate output column types at both the planner gate (query.c) and the executor entry (ray_fused_topk_select). Unsupported types fall back to the unfused FILTER + SORT + TAKE. Blocker #4 — top-K sort-key vector parsing ignored SYM width Casting `ray_data(v)` to `int64_t*` is only valid for W64. Compact W8/W16/W32 SYM vectors would read garbage. Use ray_read_sym for the width-correct load. Smell #5 — STR LIKE used the SYM threshold Earlier centralisation accidentally swapped the names; STR path used LIKE_PAR_MIN_ROWS_SYM and SYM path used LIKE_PAR_MIN_ROWS_STR. Restore the correct mapping. Smell #6 — mixed-temporal test was actually same-typed The parity test claimed to cover DATE-vs-TIMESTAMP but used DATE-vs-DATE. Replaced with same-typed DATE compares (which the fused gate accepts) and a comment explaining why a real cross- temporal compare exercises an unsupported engine path that's outside the scope of this branch. Regression coverage: test_fused_group.c - fused_group/fallback_filter_honored: builds a fused node with a nullable agg input (forces fallback) under a selective WHERE; asserts the SUM equals the filtered value, not the unfiltered. - fused_group/count1_rejects_nullable_key: builds a fused node with a nullable group key; verifies the result matches the null-aware unfused group (3 buckets, not 4 with a phantom). fused_group_parity.rfl - Top-K STR output column lex-order check (forces unfused fallback path, asserts correctness). - Same-typed DATE compares verify the fused predicate gate still fires for the supported shape. Tests: 2345 / 2347 passing (1 skipped, 1 pre-existing failure). Bench cluster ratios within noise of the prior state.
The fused top-K comparator resolves SYM IDs through the global sym_strings snapshot, and the materialiser builds a fresh ray_sym_vec_new without copying src->sym_dict. Both assume global-dict storage. Columns carrying a per-vector sym_dict (CSV-loaded SYM cols, splayed slices, IPC-deserialised SYMs) store LOCAL indices that don't map to the global table — running them through the fused path can mis-order results and produce output cols that resolve against the wrong dictionary. Reject local-dict SYM in both gates so the unfused FILTER + SORT + TAKE handles them; that path propagates sym_dict via the guarded slice-aware pattern in sort.c:3642-3660 / rerank.c:174-188. - src/ops/fused_topk.c: gate sort-key SYM and output SYM cols on dict_owner->sym_dict (slice_parent-aware). - src/ops/query.c: planner-side mirror for output cols (early fallback before DAG build). Also cleaned up a parity-test gap surfaced in the same review: the mixed-temporal section's comment claimed DATE-vs-TIMESTAMP rejection coverage but the assertions were same-typed DATE. Added an actual DATE-vs-TIMESTAMP probe asserting count 0 (the unfused-path result; the fused raw-int compare would have matched all 3 rows). New regression test exercises the new gate via a CSV-loaded SYM column inserted in non-alphabetic order — top-K asc must return alphabetic order, which only the unfused path delivers.
…lish)
Round-5 review's only remaining note: the mixed-temporal regression
test asserted count == 0 with a comment claiming raw fused would
match all 3 rows, but raw int32-days >= int64-ns is uniformly false
for any plausible TS literal — so 0 also matches the raw answer
and the test couldn't catch a regression that removes the
cross-temporal gate.
I tried flipping the column class (TIMESTAMP col vs DATE const) so
raw fused would match all rows while a normalised compare would
match a subset — but rayforce's unfused engine also compares
temporals by raw stored ints today, so both paths produce the same
count. No (op, lhs-class, rhs-class, value) tuple in current
engine semantics distinguishes raw-fused from generic for cross-
temporal compares.
Updated the test to:
- Use the TIMESTAMP-col vs DATE-const direction so the comment
is at least concrete about which raw answer is being pinned.
- Document the limitation honestly: this test verifies the gate
routes cross-temporal to the unfused path without crashing or
producing corrupted output, NOT that the result differs from
raw fused (because the engine has no proper temporal
normalisation today — that's tracked separately).
Tests still 2345 / 2347 pass.
The round-1 review fix gated fused top-K off whenever any sort key
column carried RAY_ATTR_HAS_NULLS, because fpk_cmp read the raw
payload without a null-policy leg. This commit adds that leg and
lifts the planner gate, matching sort.c's default policy:
ASC → NULLS LAST (null is "worse", evicted first in the
max-heap of K-best candidates)
DESC → NULLS FIRST (null is "better", retained first)
Implementation:
- fpk_keyspec_t carries a per-key has_nulls flag (set at compile
time from col->attrs) and a back-pointer to the column for
ray_vec_is_null. Runtime check is gated by has_nulls so the
common no-null path is byte-equivalent to before.
- fpk_cmp short-circuits the null leg before reading the raw
payload — both-null falls through to the next key (or the
source-row tie break), single-null returns immediately with
the policy-driven sign.
- query.c sort-key gate now only verifies the column exists.
Nullable sort keys are accepted for fused top-K.
Coverage in fused_group_parity.rfl: ASC top-5 with two nulls
verifies they sort last with stable source-order tie break; ASC
top-3 verifies nulls are excluded from a smaller K when LAST
policy is in effect; DESC top-3 verifies the inverse FIRST policy.
Future: caller-specified NULLS FIRST / NULLS LAST. The default
policy is hard-wired here; an explicit clause in the query DSL
would thread through fpk_keyspec_t as a third orientation flag.
Tests: 2345 / 2347 passing.
singaraiona
added a commit
that referenced
this pull request
May 7, 2026
* macOS clang -Werror,-Wc23-extensions
src/ops/fused_group.c:2365 had `materialize:` immediately
followed by an array declaration. C23 allows that but pre-C23
requires a statement after the label, and clang on macOS rejects
the extension under -Werror. Insert an empty statement so all
four matrix variants (ubuntu/macos × debug/release) build clean.
* sym/like_fn/sym_vec_null_sym test failure
The fast-path LUT projection in ray_like_fn returned 0 for any
out-of-range sym ID. The OOM fallback path right below it (and
the atom branch above) treats unknown / null-resolving sids as
empty strings — the test's commented expectation "NULL→\"\"
also matches *" is correct semantics, the fast path was just
inconsistent. Compute the empty-string match once (use_simple
or full glob), cache as empty_match, and let the row pass return
that for out-of-range sids. Now 2346 / 2347 pass with 0
failures (1 skipped).
Closes the matrix red on the branch so CI is green when reopened.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Branched from
masterto address the findings in an external code review of the fused-paths work (fp_eval_cmp/fused_group/fused_topk) — surfacing predicate-constant truncation, signed-narrow-agg reads, top-K alias drop, top-K nullmap drop, WHERE-AND reorder past fault-able conjuncts, a SYM-LIKEseen[]data race, mismatch between planner and executor support, and missing nullable / temporal gating. All findings are addressed; the original ClickBench-driven performance work stays onmasterunchanged.Per-finding mapping
93206a39fp_compile_cmprange-checkscvalagainst the column's representable range and pre-folds out-of-range to all-true / all-false via newfp_fold_t.93206a39read_signed_by_esz+ per-aggin_unsignedflag. SUM/MIN/MAX/AVG over I16/I32/DATE/TIME/TIMESTAMP sign-extend correctly.93206a39,152f3917out_alias_syms[]through toray_fused_topk_selectso the materialiser publishes under the alias.93206a39,152f3917ray_vec_set_nullper gathered row.93206a39reorder_safeflag — fault-able / side-effecting ops keep user-given order; cheap-first sort only when all conjuncts are pure comparison/logic.seen[]race (Medium)93206a39__atomic_store_n(..., __ATOMIC_RELAXED)— same x86 codegen, standard-defined.93206a39RAY_ATTR_HAS_NULLSwhen post-remap nullmap is empty.e2b863a3fp_col_supported()+ same gate inmk_compile; query.c planner mirrors.e2b863a3fp_atom_col_compatible()helper used by bothfp_check_simple_cmp(planner) andfp_compile_cmp(executor).e2b863a3950b0fedfpk_cmpbreaks all-keys-tied ties by source row index ascending — deterministic stable-sort prefix semantics.950b0fedexec_filtered_groupwraps the fused dispatch with a graceful unfused FILTER + GROUP fallback when the executor returnsnyi.950b0fedRAYFORCE_CSV_EMPTY_SYM_NULL=1to retain pre-2026-05 nullable-empty semantics.b5af5adaFPK_MAX_K,FP_COMBINE_PAR_MIN,LIKE_PAR_MIN_ROWS_STR,LIKE_PAR_MIN_ROWS_SYMnamed with rationale.04b3a2ae,152f3917test/rfl/integration/fused_group_parity.rfl+ 3 regression unit tests (eq_const_out_of_range_u8,ne_const_out_of_range_u8,sum_negative_i16).Still flagged (deliberate, with TODOs)
fpk_cmpstill rejects nullable sort keys. Adding null-policy (NULLS FIRST / NULLS LAST) needs a planner-level surface; the TODO infpk_cmpdocuments what's needed.RAYFORCE_DISABLE_FUSED_*would be stronger coverage.Test plan
make releasebuilds clean with-Werror.make test— 2343/2345 pass, 1 skipped, 1 pre-existing failure (sym/like_fn/sym_vec_null_sym, also fails onmaster).test/rfl/integration/fused_group_parity.rfl— every fused / unfused divergence the review surfaced has a deterministic regression case here.makeClickBench bench harness — cluster ratios at parity withmaster(no perf regression from the gates / fallback / threshold renaming).