Skip to content

Fused fast-paths: address external review correctness findings#198

Closed
singaraiona wants to merge 10 commits intomasterfrom
fused-correctness-review
Closed

Fused fast-paths: address external review correctness findings#198
singaraiona wants to merge 10 commits intomasterfrom
fused-correctness-review

Conversation

@singaraiona
Copy link
Copy Markdown
Collaborator

Summary

Branched from master to address the findings in an external code review of the fused-paths work (fp_eval_cmp / fused_group / fused_topk) — surfacing predicate-constant truncation, signed-narrow-agg reads, top-K alias drop, top-K nullmap drop, WHERE-AND reorder past fault-able conjuncts, a SYM-LIKE seen[] data race, mismatch between planner and executor support, and missing nullable / temporal gating. All findings are addressed; the original ClickBench-driven performance work stays on master unchanged.

Per-finding mapping

Review finding Commit Fix
#3 Constant truncation (Critical) 93206a39 fp_compile_cmp range-checks cval against the column's representable range and pre-folds out-of-range to all-true / all-false via new fp_fold_t.
#4 Signed narrow agg reads (High) 93206a39 New read_signed_by_esz + per-agg in_unsigned flag. SUM/MIN/MAX/AVG over I16/I32/DATE/TIME/TIMESTAMP sign-extend correctly.
#1 Top-K alias drop (High) 93206a39, 152f3917 Round-1 gated off renames; round-3 plumbs out_alias_syms[] through to ray_fused_topk_select so the materialiser publishes under the alias.
#2 Top-K null materialise (High) 93206a39, 152f3917 Round-1 gated off nullable cols; round-3 propagates nullmaps via ray_vec_set_null per gathered row.
#6 WHERE-AND reorder (Medium) 93206a39 reorder_safe flag — fault-able / side-effecting ops keep user-given order; cheap-first sort only when all conjuncts are pure comparison/logic.
#7 SYM-LIKE seen[] race (Medium) 93206a39 __atomic_store_n(..., __ATOMIC_RELAXED) — same x86 codegen, standard-defined.
Adjacent 93206a39 CSV step 9c strips RAY_ATTR_HAS_NULLS when post-remap nullmap is empty.
Nullable cols in fused predicate / key / agg-input (round-2 followup) e2b863a3 New fp_col_supported() + same gate in mk_compile; query.c planner mirrors.
Planner ↔ executor support divergence (round-2 followup) e2b863a3 Shared fp_atom_col_compatible() helper used by both fp_check_simple_cmp (planner) and fp_compile_cmp (executor).
Mixed temporal types (round-2 followup) e2b863a3 DATE/TIME/TIMESTAMP cols require same-typed atom RHS; cross-temporal compares fall back to OP_GROUP.
#9 Top-K tie ordering (Medium) 950b0fed fpk_cmp breaks all-keys-tied ties by source row index ascending — deterministic stable-sort prefix semantics.
#8 Runtime fallback (Medium) 950b0fed exec_filtered_group wraps the fused dispatch with a graceful unfused FILTER + GROUP fallback when the executor returns nyi.
#5 CSV empty-SYM policy (Medium / High) 950b0fed Documented inline; opt-out via RAYFORCE_CSV_EMPTY_SYM_NULL=1 to retain pre-2026-05 nullable-empty semantics.
#4 (recommendation) Centralise thresholds b5af5ada FPK_MAX_K, FP_COMBINE_PAR_MIN, LIKE_PAR_MIN_ROWS_STR, LIKE_PAR_MIN_ROWS_SYM named with rationale.
#2 (recommendation) Adversarial tests 04b3a2ae, 152f3917 I16/I32 boundary constants, full-range SUM/MIN/MAX, multi-key composite with negative I16, top-K with negative sort key, top-K nullmap propagation, top-K alias rename.
#10 Test coverage every commit test/rfl/integration/fused_group_parity.rfl + 3 regression unit tests (eq_const_out_of_range_u8, ne_const_out_of_range_u8, sum_negative_i16).

Still flagged (deliberate, with TODOs)

  • Top-K sort-key nullablefpk_cmp still rejects nullable sort keys. Adding null-policy (NULLS FIRST / NULLS LAST) needs a planner-level surface; the TODO in fpk_cmp documents what's needed.
  • Multi-process fused/unfused parity harness — the deterministic in-process parity test catches the surfaced regressions, but a forked-process harness running each query under RAYFORCE_DISABLE_FUSED_* would be stronger coverage.

Test plan

  • make release builds clean with -Werror.
  • make test — 2343/2345 pass, 1 skipped, 1 pre-existing failure (sym/like_fn/sym_vec_null_sym, also fails on master).
  • test/rfl/integration/fused_group_parity.rfl — every fused / unfused divergence the review surfaced has a deterministic regression case here.
  • make ClickBench bench harness — cluster ratios at parity with master (no perf regression from the gates / fallback / threshold renaming).

singaraiona added 10 commits May 7, 2026 13:23
The fused predicate / agg / top-K paths landed with several
fast-path-only assumptions that broke for non-benchmark-shaped
inputs.  This commit closes the gaps an external review flagged:

* fp_eval_cmp constant truncation
  fp_compile_cmp now range-checks the decoded `cval` against the
  column's representable range and pre-folds the comparison to all-
  true / all-false when the constant lies outside it.  Previously
  `u8_col == 300` was silently rewritten to `u8_col == 44`
  (300 cast to uint8_t) and matched whatever rows held value 44.
  fp_eval_cmp now short-circuits via the new fp_fold_t enum before
  the per-row width-cast compare runs.

* signed narrow aggregates read as unsigned
  Added read_signed_by_esz and a per-agg in_unsigned flag.  SUM /
  MIN / MAX / AVG over an I16 / I32 / DATE / TIME / TIMESTAMP
  column now sign-extend correctly, so a stored -1 reads as -1 and
  not 65535.  read_by_esz remains in place for SYM dictionary IDs
  and composite-key packing where the bit pattern is what matters.

* fused top-K alias drop
  The fused materialiser names output columns after the source
  symbol — schemas with renames like `{alias: source_col}` would
  silently emit a column named `source_col`.  Tighten the planner
  gate to require alias == source until the fused exec accepts a
  separate aliases array.

* fused top-K nullable columns
  The fused materialiser does not propagate source nullmaps and the
  heap comparator does not implement null ordering.  Gate the fused
  path off when any sort key or output column carries
  RAY_ATTR_HAS_NULLS.

* WHERE-AND reorder past fault-able conjuncts
  The cost-based reorder previously reshuffled any conjuncts that
  compiled to a vector.  An expression like
  `(and (!= y 0) (< (/ x y) 10))` could be reordered, exposing the
  divide-by-zero.  Track a reorder_safe flag — when any conjunct
  contains an op outside the comparison/logic vocabulary
  (OP_DIV/OP_MOD/user-call/etc.) we still chain the conjuncts as
  separate filters but evaluate them in user-given order.

* SYM-LIKE seen[] data race
  The seen-mark workers wrote a plain non-atomic 1 to a shared
  byte; even with idempotent value, that's UB in C/C++.  Switch to
  __atomic_store_n with __ATOMIC_RELAXED — same x86 codegen as a
  plain byte store but standard-defined.

* CSV nullmap strip after empty-SYM remap
  csv_remap_empty_sym_to_id clears null bits for empty SYM rows but
  did not flip col_had_null, so RAY_ATTR_HAS_NULLS stuck on
  columns whose nullmap was actually all-zero.  Step 9c now also
  walks the post-remap nullmap and strips the attribute when no
  bits remain set.  This unblocks the new fused-top-K nullable gate
  for ClickBench-shaped SYM columns that were never really nullable.

Regression coverage in test_fused_group:
  - eq_const_out_of_range_u8: u8 == 300 must match 0 rows
  - ne_const_out_of_range_u8: u8 != 300 must match every row
  - sum_negative_i16: SUM(-1, -2, 3, 4) must equal 4 (not 131076)
Round 1 left the deeper structural concerns from the review intact:
nullable predicate / key / agg-input columns were still accepted, the
planner-side support check diverged from the executor compile, and
mixed temporal types could pass through.  This commit closes them.

Nullable column gates (was: only top-K sort/output checked):

  - fp_compile_cmp now rejects predicate columns with
    RAY_ATTR_HAS_NULLS via a new fp_col_supported() helper — the
    fused per-row evaluator reads raw payload, so a sentinel slot
    would diverge from the unfused null-aware kernel.
  - mk_compile rejects HAS_NULLS on every group key column (the
    composite key compose can't distinguish a null sentinel from a
    legitimate value with the same bit pattern) and on every
    aggregate input column (raw reads would corrupt SUM/MIN/MAX/AVG).
  - The query.c planner gate mirrors both checks so the planner
    never enables fused for a shape the executor would reject.

Planner / executor symmetry (was: planner accepted any atom RHS):

  - New helper fp_atom_col_compatible(atom_type, col_type) shared by
    fp_check_simple_cmp (planner) and fp_compile_cmp (executor).
    Gates that previously diverged now check identical conditions:
    the planner stops handing nyi shapes to the executor at runtime.

Mixed temporal rejection (was: DATE/TIME/TIMESTAMP all interchangeable):

  - fp_atom_col_compatible requires DATE/TIME/TIMESTAMP cols to have
    a same-typed atom on the RHS.  The fused compare reads raw stored
    units (days vs microseconds vs nanoseconds), so a cross-temporal
    constant would have produced meaningless inequalities.  Bail to
    OP_GROUP, which normalises temporals before the compare.

Coverage (test/rfl/integration/fused_group_parity.rfl):

  - U8 column with the value 44 + predicate `u8 == 300` ⇒ 0 rows
    (would match 44 if the truncation fold were missing).
  - I16 column with negative values: SUM = 5 (-1+-2+3+4+-5+6),
    MIN = -5, MAX = 6 (would diverge if the signed read were off).
  - Tautology / contradiction folds on out-of-range U8 constants.
  - Chained AND with cheap+expensive conjuncts agrees regardless of
    reorder choice.

Tests: 2342 / 2344 passing (1 skipped, 1 pre-existing failure).
Bench cluster ratios at parity with the prior wide-key state.
Round 2 closed structural gates; round 3 addresses remaining concerns
the review explicitly flagged.

Top-K tie ordering (#9):

  fpk_cmp now breaks all-keys-tied ties by source row index.
  Previously the comparator returned 0 on tie, so worker-pool
  scheduling non-determinism could cause the K-best heap to retain
  different rows across runs of the same query against the same
  data.  The new tie-break is direction-independent: for both ASC
  and DESC top-K we want the K rows with the smallest source
  indices to survive — that matches a stable sort's prefix.

Runtime fallback symmetry (#8):

  exec_filtered_group now wraps the count1 / multi dispatch with a
  graceful unfused fallback.  When fp_compile_pred or mk_compile
  rejects the shape (returning a 'nyi' error), we build a fresh
  FILTER + GROUP subgraph from the same predicate / keys / agg-input
  ops on the OP_FILTERED_GROUP node and execute it via ray_execute.
  Defence in depth — the round-2 planner gate tightening should
  prevent exec from seeing unsupported shapes, but if a future
  change introduces a divergence we degrade to a slower-but-correct
  result instead of surfacing a user-visible error.

CSV empty-SYM policy documentation + opt-out (#5):

  The empty-CSV-field → empty-symbol remap is now an explicit,
  documented policy.  Default behaviour is unchanged (matches DuckDB
  / Spark / polars).  Setting RAYFORCE_CSV_EMPTY_SYM_NULL=1 reverts
  to the pre-2026-05 semantics where empty fields stay nullable.
  The block comment around csv.c:599 spells out the trade-off and
  notes that the SYM type can't carry a separate empty/null
  distinction on the value side.

Extended parity tests (#10):

  test/rfl/integration/fused_group_parity.rfl now also covers:

    - Nullable predicate columns (gate routes to unfused, result
      matches null-aware semantics).
    - Nullable agg-input column (sum is null-aware).
    - Nullable group key column (nulls bucket separately).
    - DATE column compared to a same-typed atom (works); the
      gate prevents mixed temporal compares from firing fused.
    - Top-K aliased projection (alias preserved in output schema).
    - Top-K tie ordering (deterministic source-order prefix).

Tests: 2343 / 2345 passing (1 skipped, 1 pre-existing failure).
Bench cluster ratios at parity with the prior round-2 state — no
performance regression from the additional gates and fallback.
The review flagged a number of bare numeric constants threaded
through the fused fast paths.  Move them behind named macros with
inline rationale so future tuning has a single source of truth and
new readers can see the why instead of just the value.

  FPK_MAX_K (8192)
    Top-K bound — couples to a stack-resident global_idx[FPK_MAX_K]
    in fused_topk_select.  Larger K falls through to unfused
    FILTER + SORT + TAKE which has no buffer constraint.  Used by
    both query.c gates (planner) and fused_topk.c (executor).

  FP_COMBINE_PAR_MIN (50000)
    Crossover row count below which the parallel 3-pass radix
    combine in fused_group loses to the serial walk because of
    fixed dispatch + scratch alloc cost.

  LIKE_PAR_MIN_ROWS_STR (200000)
  LIKE_PAR_MIN_ROWS_SYM (100000)
    Parallelism crossover for OP_LIKE.  STR scans match per-row;
    SYM scans match per-dict-entry, so SYM parallelises well at
    lower row counts.

FP_SHARD_INIT_CAP / FP_SHARD_MAX_CAP keep their values but pick up
a comment block explaining what the bound protects (per-shard memory
budget) and what happens at the boundary (OOM → unfused fallback).

No behaviour change — every replaced expression resolves to the same
integer at the same point in the code.

Tests: 2343 / 2345 (1 skipped, 1 pre-existing failure).
Bench cluster ratios at parity with the prior round-3 state.
Extend the parity test with width-boundary and multi-key shapes that
the review's recommendation #2 ("adversarial random-data tests")
explicitly called out:

  - I16 boundary constants on both ends (INT16_MIN-1, INT16_MAX+1)
    plus the actual MIN/MAX values living in the column.  All folds
    go to all-true / all-false consistent with arithmetic semantics.
  - I16 SUM/MIN/MAX with the full range: -32768 + -1 + 0 + 1 +
    32767 = -1, MIN = -32768, MAX = 32767.  Catches a regression in
    sign-extension across all four signed agg kinds.
  - I32 boundary (INT32_MAX + 1 == 2147483648 must fold to zero
    matches; full-range SUM = -1).
  - Multi-key composite (I16 + I32) with negative I16 component —
    exercises the narrow-path key compose that packs both keys into
    a single int64 slot.  Distinct-pair count and sum agree with
    the unfused result.
  - Top-K with a negative-valued I32 sort key — verifies signed
    reads in the heap comparator.

Tests: 2343 / 2345 passing.
Lift two of the conservative gates the round-1 review fix added:

  * Output column aliases.  The previous round forced the planner
    to refuse fused top-K for any rename (`{alias: source_col}`).
    The exec now takes an explicit out_alias_syms[] array and uses
    it when adding columns to the result table — source col is
    still what we gather from, alias is what we publish under.

  * Nullable output columns.  The materialise loop now calls
    ray_vec_set_null per gathered row when the source column
    carries RAY_ATTR_HAS_NULLS, so nullable select columns survive
    a fused top-K without losing the null bits.  ray_vec_set_null
    lazily allocates dst's nullmap on the first set, so we only
    pay the alloc cost when there are actual nulls in the result.

The sort-key nullable gate stays in place — fpk_cmp doesn't yet
implement null ordering (nulls-last / nulls-first), and adding it
needs a planner-level NULLS FIRST/LAST surface.  The TODO in
fpk_cmp documents what's needed to lift that gate too.

CSV empty-SYM policy comment cleaned up — the previous text pointed
to a nonexistent ROADMAP.md; the trade-off is now described inline.

Parity tests cover both lifts: round-trip null in a nullable output
column through top-K, and round-trip an alias name through top-K.

Tests: 2343 / 2345 passing.
Six follow-up issues from the second-round review of
`fused-correctness-review`.  All four blockers fixed, both smells
fixed, plus regression coverage for the two correctness blockers.

Blocker #1 — fallback silently dropped the filter predicate
  exec_filtered_group_fallback was building `ray_const_table → ray_filter`
  then overwriting root with `ray_group(...)`, but ray_group has
  arity 0 and consumes `g->table` directly — the filter node was
  orphaned.  Worse, naively chaining ray_execute(filter) then
  ray_execute(group) doesn't preserve the selection either: the
  outer ray_execute compacts and clears g->selection on return, so
  the group call sees the unfiltered g->table.
  Fix: take the materialised filtered table from the first
  ray_execute, swap it in as g->table for the duration of the group
  call, restore afterwards.

Blocker #2 — count1 path accepted nullable group keys
  Planner gate already rejected this in query.c, but a direct
  ray_filtered_group() C-API call bypassed the planner.  Mirror
  mk_compile's HAS_NULLS check at exec_filtered_group_count1.

Blocker #3 — top-K materialiser corrupted unsupported output types
  ray_fused_topk_select accepted any column type but the materialise
  loop reads via read_by_esz / writes via write_col_i64, which only
  handle scalar BOOL/U8/I16/I32/I64/DATE/TIME/TIMESTAMP/SYM.  RAY_STR
  / GUID / list-shape would gather the raw int64 of the heap-string
  pointer and write a single byte, producing garbage.
  Fix: gate output column types at both the planner gate (query.c)
  and the executor entry (ray_fused_topk_select).  Unsupported types
  fall back to the unfused FILTER + SORT + TAKE.

Blocker #4 — top-K sort-key vector parsing ignored SYM width
  Casting `ray_data(v)` to `int64_t*` is only valid for W64.
  Compact W8/W16/W32 SYM vectors would read garbage.  Use
  ray_read_sym for the width-correct load.

Smell #5 — STR LIKE used the SYM threshold
  Earlier centralisation accidentally swapped the names; STR path
  used LIKE_PAR_MIN_ROWS_SYM and SYM path used LIKE_PAR_MIN_ROWS_STR.
  Restore the correct mapping.

Smell #6 — mixed-temporal test was actually same-typed
  The parity test claimed to cover DATE-vs-TIMESTAMP but used
  DATE-vs-DATE.  Replaced with same-typed DATE compares (which the
  fused gate accepts) and a comment explaining why a real cross-
  temporal compare exercises an unsupported engine path that's
  outside the scope of this branch.

Regression coverage:
  test_fused_group.c
    - fused_group/fallback_filter_honored: builds a fused node with a
      nullable agg input (forces fallback) under a selective WHERE;
      asserts the SUM equals the filtered value, not the unfiltered.
    - fused_group/count1_rejects_nullable_key: builds a fused node
      with a nullable group key; verifies the result matches the
      null-aware unfused group (3 buckets, not 4 with a phantom).
  fused_group_parity.rfl
    - Top-K STR output column lex-order check (forces unfused
      fallback path, asserts correctness).
    - Same-typed DATE compares verify the fused predicate gate
      still fires for the supported shape.

Tests: 2345 / 2347 passing (1 skipped, 1 pre-existing failure).
Bench cluster ratios within noise of the prior state.
The fused top-K comparator resolves SYM IDs through the global
sym_strings snapshot, and the materialiser builds a fresh
ray_sym_vec_new without copying src->sym_dict.  Both assume
global-dict storage.  Columns carrying a per-vector sym_dict
(CSV-loaded SYM cols, splayed slices, IPC-deserialised SYMs)
store LOCAL indices that don't map to the global table — running
them through the fused path can mis-order results and produce
output cols that resolve against the wrong dictionary.

Reject local-dict SYM in both gates so the unfused FILTER + SORT
+ TAKE handles them; that path propagates sym_dict via the
guarded slice-aware pattern in sort.c:3642-3660 / rerank.c:174-188.

- src/ops/fused_topk.c: gate sort-key SYM and output SYM cols on
  dict_owner->sym_dict (slice_parent-aware).
- src/ops/query.c: planner-side mirror for output cols (early
  fallback before DAG build).

Also cleaned up a parity-test gap surfaced in the same review:
the mixed-temporal section's comment claimed DATE-vs-TIMESTAMP
rejection coverage but the assertions were same-typed DATE.
Added an actual DATE-vs-TIMESTAMP probe asserting count 0 (the
unfused-path result; the fused raw-int compare would have
matched all 3 rows).

New regression test exercises the new gate via a CSV-loaded SYM
column inserted in non-alphabetic order — top-K asc must return
alphabetic order, which only the unfused path delivers.
…lish)

Round-5 review's only remaining note: the mixed-temporal regression
test asserted count == 0 with a comment claiming raw fused would
match all 3 rows, but raw int32-days >= int64-ns is uniformly false
for any plausible TS literal — so 0 also matches the raw answer
and the test couldn't catch a regression that removes the
cross-temporal gate.

I tried flipping the column class (TIMESTAMP col vs DATE const) so
raw fused would match all rows while a normalised compare would
match a subset — but rayforce's unfused engine also compares
temporals by raw stored ints today, so both paths produce the same
count.  No (op, lhs-class, rhs-class, value) tuple in current
engine semantics distinguishes raw-fused from generic for cross-
temporal compares.

Updated the test to:
  - Use the TIMESTAMP-col vs DATE-const direction so the comment
    is at least concrete about which raw answer is being pinned.
  - Document the limitation honestly: this test verifies the gate
    routes cross-temporal to the unfused path without crashing or
    producing corrupted output, NOT that the result differs from
    raw fused (because the engine has no proper temporal
    normalisation today — that's tracked separately).

Tests still 2345 / 2347 pass.
The round-1 review fix gated fused top-K off whenever any sort key
column carried RAY_ATTR_HAS_NULLS, because fpk_cmp read the raw
payload without a null-policy leg.  This commit adds that leg and
lifts the planner gate, matching sort.c's default policy:

  ASC  → NULLS LAST   (null is "worse", evicted first in the
                       max-heap of K-best candidates)
  DESC → NULLS FIRST  (null is "better", retained first)

Implementation:

  - fpk_keyspec_t carries a per-key has_nulls flag (set at compile
    time from col->attrs) and a back-pointer to the column for
    ray_vec_is_null.  Runtime check is gated by has_nulls so the
    common no-null path is byte-equivalent to before.

  - fpk_cmp short-circuits the null leg before reading the raw
    payload — both-null falls through to the next key (or the
    source-row tie break), single-null returns immediately with
    the policy-driven sign.

  - query.c sort-key gate now only verifies the column exists.
    Nullable sort keys are accepted for fused top-K.

Coverage in fused_group_parity.rfl: ASC top-5 with two nulls
verifies they sort last with stable source-order tie break; ASC
top-3 verifies nulls are excluded from a smaller K when LAST
policy is in effect; DESC top-3 verifies the inverse FIRST policy.

Future: caller-specified NULLS FIRST / NULLS LAST.  The default
policy is hard-wired here; an explicit clause in the query DSL
would thread through fpk_keyspec_t as a third orientation flag.

Tests: 2345 / 2347 passing.
@singaraiona singaraiona closed this May 7, 2026
singaraiona added a commit that referenced this pull request May 7, 2026
  * macOS clang -Werror,-Wc23-extensions
    src/ops/fused_group.c:2365 had `materialize:` immediately
    followed by an array declaration.  C23 allows that but pre-C23
    requires a statement after the label, and clang on macOS rejects
    the extension under -Werror.  Insert an empty statement so all
    four matrix variants (ubuntu/macos × debug/release) build clean.

  * sym/like_fn/sym_vec_null_sym test failure
    The fast-path LUT projection in ray_like_fn returned 0 for any
    out-of-range sym ID.  The OOM fallback path right below it (and
    the atom branch above) treats unknown / null-resolving sids as
    empty strings — the test's commented expectation "NULL→\"\"
    also matches *" is correct semantics, the fast path was just
    inconsistent.  Compute the empty-string match once (use_simple
    or full glob), cache as empty_match, and let the row pass return
    that for out-of-range sids.  Now 2346 / 2347 pass with 0
    failures (1 skipped).

Closes the matrix red on the branch so CI is green when reopened.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant