fix: 9 engine bugs surfaced by a weak-assertion coverage audit by ser-vasilich · Pull Request #226 · RayforceDB/rayforce

ser-vasilich · 2026-06-05T10:27:27Z

What this PR fixes

Eight real engine bugs, each previously hidden behind type/count-only test
assertions (a function's result type or row/group count was checked, never
its value — so llvm-cov marked the code "covered" while wrong values passed
silently). Severity-ordered:

#	Bug	Effect
1	I32 WHERE-branch `update` strode the source column at a hardcoded 8-byte width	memory corruption — rows not selected by WHERE clobbered (`update {v:0Ni from:t where:(==k 1i)}` over `v=[10 20 30]` → `[null 30 garbage]`)
2	negative-zero F64 grouping: equality key stored raw bits while the hash normalized `-0.0→+0.0`	data loss — pivot value column zero-filled / key columns collapsed into a shadowed `"0"`
3	typed I32 scalar in a WHERE filter reinterpreted as `double`	`select {from:t where:(== k (as 'I32 7))}` selected the `k=0` group
4	narrow-int (I32/I16/U8) per-group `min`/`max` stored at int64 width	wrong values for every group past index 0
5	non-by scalar `pearson_corr` lacked the binary-aggregator guard	returned integer `0` instead of the correlation
6	multi-key no-agg `by` re-evaluated the `by` SYM-vector as a literal	WHERE ignored — grouped the unfiltered columns; also dropped non-key columns
7	`sel_compact` allocated `RAY_LIST` (==0) via `typed_vec_new`	non-key LIST column silently dropped under WHERE
8	all-pass filter on a splayed/parted table returned the source verbatim	un-materialized segmented column → `type` error on `(at … 'col)`
9	`count_distinct_approx` merged worker HLL shards into `shards[0]`, which a dynamically-scheduled idle worker can leave zero-initialized	non-deterministic 0 estimate (intermittent macos-release CI failure)

Commits

fix(group) — pivot -0.0 grouping (parse_str() segfaults for certain inputs #2), narrow-int min/max (Byte and byte vector literals not implemented #4), scalar pearson (UB warning when a table is being constructed #5)
fix(query) — I32 update corruption + typed-scalar WHERE (Update README.md #1, Converting a string to a symbol may add random garbage #3), multi-key no-agg WHERE (Segfault in binary_call_atomic #6), LIST drop (Error when parsing char literals #7)
fix(query) — carry non-key first-of-group columns in multi-key no-agg by-group (completes Segfault in binary_call_atomic #6)
fix(exec) — materialize parted columns on all-pass / no-selection filter (Casting an empty list to a symbol vector does not work #8)

How they were found

A weak-assertion audit (the two test: commits) replaced type/count-only
assertions with real value checks across ~92 RFL files. Where a stronger
assertion exposed a real engine bug, the correct value is kept in a
;; xfail BUG: comment with the buggy value pinned live, so the bug stays
visible while the suite stays green — then resolved here, flipping each xfail
back to the correct value as its fix lands.

Verification

Full suite green under ASan/UBSan and under a clang coverage build: 0 failed.
No src/ change without a value-check that fails before the fix and passes after.
Constraints honored: only internal allocators (no libc malloc), no de-static, parted/LIST gathers retain correctly (ASan-clean, no double-free).

Coverage (src only): 89% lines / 85% regions overall; the five touched files 85–96% lines.

🤖 Generated with Claude Code

…(wave 1) Audit of a catastrophic false-positive-coverage anti-pattern: tests like `(type (min ['to 'ab 'c])) -- 'sym` or `(count (select …)) -- N` mark a function as "covered" in llvm-cov while never verifying its actual result value — so a function returning wrong values would silently pass. Five parallel agents swept five function categories and added ~290 value-level assertions across 60 RFL files, pairing each surviving type/count check with a companion value assertion computed by hand from the input fixture: math/agg min/max/sum/avg/var/stddev/first/last/top/bot/abs/neg/ sqrt/log/exp/round/floor/ceil/pow (~70 asserts) collection at/take/distinct/asc/desc/concat/union/except/find/ rank/group-keys (~50 asserts) cast/serde as (de-tautologised ~40 (type (as 'T x))->'T), de/ser round-trips, parse/eval, temporal ctors (~80 asserts) query select/group/pivot/datalog/like/opt result values, not just row counts (~50 asserts) domain knn/ann row identities, joins (inner/left/anti/asof) column values, graph algo results (~40 asserts) No src/ bugs were found: every mismatch investigated turned out to be documented k/q convention (sum widens narrow ints to i64; null is a distinct least value so (!= 0N 0)=true, (< 0N 0)=true, (== 0N 0)=false). The fused_group_parity (!= v 0) case is now asserted at its correct q-convention value (sum 5) with an explanatory comment rather than an xfail. Suite: 3232 of 3234 pass (2 skipped, 0 failed) under ASan+UBSan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…(wave 2) Continue the false-positive-coverage audit. 90 RFL files had sites that asserted only a function's result TYPE or row/group COUNT, never its VALUE -- so llvm-cov marked the code "covered" while wrong values passed silently. Each weak site now carries a hand-computed value check beside (never replacing) the original type/count line. Per-group results are verified order-independently via key-masked SOURCE lookup (select {agg by: k from: T where: (== k <key>)}) because group emission order is non-deterministic. Float results use tolerance. Real engine bugs surfaced by the stronger checks are pinned LIVE as xfail (correct value in a ;; xfail BUG: comment, current buggy value asserted) so the suite stays green while the bug stays visible: - I32 update memory corruption (non-selected rows clobbered) - I32 typed-scalar WHERE reads atom as 0 -> wrong group (3 repros) - parted STR/SYM all-pass filter returns un-materialized column - negative-zero F64 pivot zero-fills value column / collapses key cols - multi-key no-agg by-group ignores WHERE; drops non-key LIST column - non-by scalar pearson_corr returns 0 instead of correlation - narrow-int min/max wrong at group index >= 1 Full suite: 3232/3234 pass, 2 skipped, 0 failed (clean ASan/UBSan). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two group-by aggregation bugs surfaced by the weak-assertion audit. 1. Narrow-int min/max wrong past group index 0. The emit path stored results with a fixed 8-byte int64 write, overshooting a 4/2/1-byte I32/I16/U8 output column so every group after index 0 landed at the wrong offset (correct only at gi==0; single-group masked queries always correct). Store at the output element width via topk_write_i64 + ray_sym_elem_size, in both the DA and HT emitters. The all-null-group null sentinel is likewise widened (agg_int_null_sentinel_for) so it is not truncated to 0. 2. Non-by scalar pearson_corr returned integer 0. The n_keys==0 scalar fast-path lacked the binary-aggregator guard the DA path already has, so Pearson accumulated nothing and fell to the integer emit default. Skip the scalar fast-path when a binary aggregator is present, routing to the HT reducer that computes the cross-terms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Three query-execution-path bugs surfaced by the weak-assertion audit. 1. I32 typed-atom mishandling (two symptoms, two root causes, both I32-only because 4 != 8). a. WHERE-branch UPDATE memory corruption. The merge loop read the source column at a hardcoded 8-byte stride; for an I32 column (4-byte elems) this mis-indexed and read past the end, clobbering the rows NOT selected by WHERE -- e.g. update {v:0Ni from:t where:(==k 1i)} over k=[1 2 3] v=[10 20 30] yielded v=[null 30 garbage] instead of [null 20 30]. Stride by ray_elem_size(ct). (query.c) b. Typed I32 scalar in a WHERE filter read as 0. The fused-expr unary CAST I32->I64 branch had no I32 source case and fell into the f64 fallback, reinterpreting the 4-byte int as a double (small values -> 0), so select {from:t where:(== k (as 'I32 7))} selected the k=0 group. Add I32/I16/DATE/TIME source arms. (expr.c) 2. Multi-key no-agg by-group ignored WHERE. select {by:[k1 k2] from:t where:...} with no aggregate routed to a computed-key fallback that re-evaluated the by SYM-vector as a literal symbol list, grouping the UNFILTERED columns (one group per key name, not per distinct composite value). Route the n_out==0 multi-key case to eval-level grouping (mirroring the existing GUID case), which groups the filtered rows by the real composite key. (query.c) Follow-up: the eval-group path returns only key columns for this shape; the first-of-group non-key columns are carried by the next commit. 3. Non-key LIST column dropped under WHERE. sel_compact allocated result columns with typed_vec_new, but RAY_LIST==0 is rejected by ray_vec_new, so the column was left NULL and silently dropped (INT columns survived; LIST- specific) -- select {by:k from:t where:...} lost a non-key LIST column. Allocate LIST via ray_list_new and gather its element pointers with a retain, excluding LIST from the non-retaining byte-copy gather paths to avoid an alias/double-free. (filter.c) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

An all-pass filter on a splayed/parted table (e.g. where: (like s "*"), or a select with no where:) hit the no-selection short-circuit and returned the source table verbatim, leaving its columns in segmented (RAY_IS_PARTED) / RAY_MAPCOMMON form. A subsequent (at result 'col) then raised a type error, while a partial filter (non-NULL selection) flattened correctly via sel_compact -- the defect was specific to the no-selection finalization path. Flatten residual parted/mapcommon columns at finalization into a freshly built table (never mutating the possibly-stored source), matching the partial-filter path. Reuses the OP_SCAN cold-path flatten via a new flatten_parted_col helper. Flat in-memory tables take the zero-copy short-circuit unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Completes the multi-key no-agg WHERE fix. After routing select {by:[k1 k2] from:t} (no aggregate) to the eval-group path, the result carried only the key columns -- the single-key path returns the first-of-group value of every non-key column, so the multi-key shape silently dropped them (e.g. column 'val). Gather the first-of-group value of each non-key column for the n_out==0 multi-key case, mirroring the single-key first-of-group gather (typed STR/LIST/scalar copy preserving native type and nulls), rather than ray_at_fn which wraps a vector index into a LIST. select {by:[k1 k2] from:t} now returns the key columns plus native-typed first-of-group non-key columns, matching select {by:k1 from:t}. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

CI (macos + ubuntu) surfaced three audit assertions that pinned a specific emission order which is actually platform-dependent: - db_mount `(key M)` / `(key Mp)`: a directory-mount dict is keyed by table names in filesystem readdir order, which differs across platforms (and from the local dev box). Compare the sorted set of names. - MST edge order: Kruskal's emit order tie-breaks equal-weight edges via an unstable weight sort, so the order (and, for a 3-cycle with unit weights, the very edge SET) varies by platform. graph_basic / traverse_branch_cov: assert sorted edge columns + total weight where the MST set is unique; drop the per-edge pins where the set itself is non-unique (keep count + total weight). - shortest-path: several equal-length 0->5 routes exist; which BFS returns depends on adjacency iteration order. Assert node count, monotonic per-hop depth, and the fixed endpoints, not the specific intermediate nodes. No engine behavior change; only the test assertions are made robust. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A pivot whose F64 KEY column holds both 0.0 and -0.0 split them into two groups: the group-by equality key compares raw bits (0x8000.. != 0x0) while the column name normalizes via clear_neg_zero, so two same-named "0" columns were emitted (and the value column zero-filled in the lambda fallback). Normalize the pivot KEY column once, up front, in ray_pivot_fn (a thin wrapper around the now-static pivot_fn_impl): if the F64 key holds any -0.0, build a shallow table copy whose pivot column has -0.0 collapsed to +0.0 via clear_neg_zero, then run the pivot on it. Both the DAG fast path and the generic lambda fallback then group 0.0/-0.0 together. The merge must happen BEFORE grouping (so (index,0.0) and (index,-0.0) land in one group and sum); normalizing after the group-by would overwrite, not merge. Cold path only: one O(n) bit-check scan of the pivot key (early-exit when no -0.0), and a table copy solely when -0.0 is present. The group-by per-row hot loop and its -fno-signed-zeros vectorization are untouched. clear_neg_zero is an integer bit-check, so 0.0/-0.0 collapse identically across linux/macos and debug/release. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…prox ray_count_distinct_approx allocates one HLL shard per worker via scratch_calloc (zero-initialized: regs==NULL, m==0) and merges shards[1..nw) into shards[0] before estimating. Under dynamic worker scheduling a thread — including the one that owns shards[0] — may process no chunk, leaving shards[0] zero-initialized. ray_hll_merge(dst=shards[0], src) then hits its precision guard (dst->m 0 != src->m) and returns without merging, so every merge is silently skipped and the estimate falls to 0 (e.g. "RAY_I16 big: got 0, expected ~250"). This is non-deterministic — it only fires when the shards[0]-owning worker happens to stay idle — which is why it surfaced as an intermittent CI failure on macos-release. Merge into the first shard that actually saw data (the first with regs != NULL) and estimate from that, instead of assuming shards[0] is initialized. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ser-vasilich and others added 2 commits June 5, 2026 12:27

ser-vasilich marked this pull request as draft June 5, 2026 11:06

ser-vasilich marked this pull request as ready for review June 5, 2026 11:46

ser-vasilich and others added 6 commits June 5, 2026 14:50

ser-vasilich force-pushed the test/weak-assertion-audit branch from be28802 to 33f0b46 Compare June 5, 2026 11:51

ser-vasilich marked this pull request as draft June 5, 2026 11:55

ser-vasilich marked this pull request as ready for review June 5, 2026 12:13

ser-vasilich changed the title ~~fix: 8 engine bugs surfaced by a weak-assertion coverage audit~~ fix: 9 engine bugs surfaced by a weak-assertion coverage audit Jun 5, 2026

singaraiona merged commit 7ba4c7f into master Jun 5, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: 9 engine bugs surfaced by a weak-assertion coverage audit#226

fix: 9 engine bugs surfaced by a weak-assertion coverage audit#226
singaraiona merged 9 commits into
masterfrom
test/weak-assertion-audit

ser-vasilich commented Jun 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ser-vasilich commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR fixes

Commits

How they were found

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ser-vasilich commented Jun 5, 2026 •

edited

Loading