Skip to content

fix: 9 engine bugs surfaced by a weak-assertion coverage audit#226

Merged
singaraiona merged 9 commits into
masterfrom
test/weak-assertion-audit
Jun 5, 2026
Merged

fix: 9 engine bugs surfaced by a weak-assertion coverage audit#226
singaraiona merged 9 commits into
masterfrom
test/weak-assertion-audit

Conversation

@ser-vasilich
Copy link
Copy Markdown
Collaborator

@ser-vasilich ser-vasilich commented Jun 5, 2026

What this PR fixes

Eight real engine bugs, each previously hidden behind type/count-only test
assertions
(a function's result type or row/group count was checked, never
its value — so llvm-cov marked the code "covered" while wrong values passed
silently). Severity-ordered:

# Bug Effect
1 I32 WHERE-branch update strode the source column at a hardcoded 8-byte width memory corruption — rows not selected by WHERE clobbered (update {v:0Ni from:t where:(==k 1i)} over v=[10 20 30][null 30 garbage])
2 negative-zero F64 grouping: equality key stored raw bits while the hash normalized -0.0→+0.0 data loss — pivot value column zero-filled / key columns collapsed into a shadowed "0"
3 typed I32 scalar in a WHERE filter reinterpreted as double select {from:t where:(== k (as 'I32 7))} selected the k=0 group
4 narrow-int (I32/I16/U8) per-group min/max stored at int64 width wrong values for every group past index 0
5 non-by scalar pearson_corr lacked the binary-aggregator guard returned integer 0 instead of the correlation
6 multi-key no-agg by re-evaluated the by SYM-vector as a literal WHERE ignored — grouped the unfiltered columns; also dropped non-key columns
7 sel_compact allocated RAY_LIST (==0) via typed_vec_new non-key LIST column silently dropped under WHERE
8 all-pass filter on a splayed/parted table returned the source verbatim un-materialized segmented column → type error on (at … 'col)
9 count_distinct_approx merged worker HLL shards into shards[0], which a dynamically-scheduled idle worker can leave zero-initialized non-deterministic 0 estimate (intermittent macos-release CI failure)

Commits

How they were found

A weak-assertion audit (the two test: commits) replaced type/count-only
assertions with real value checks across ~92 RFL files. Where a stronger
assertion exposed a real engine bug, the correct value is kept in a
;; xfail BUG: comment with the buggy value pinned live, so the bug stays
visible while the suite stays green — then resolved here, flipping each xfail
back to the correct value as its fix lands.

Verification

  • Full suite green under ASan/UBSan and under a clang coverage build: 0 failed.
  • No src/ change without a value-check that fails before the fix and passes after.
  • Constraints honored: only internal allocators (no libc malloc), no de-static, parted/LIST gathers retain correctly (ASan-clean, no double-free).

Coverage (src only): 89% lines / 85% regions overall; the five touched files 85–96% lines.

🤖 Generated with Claude Code

ser-vasilich and others added 2 commits June 5, 2026 12:27
…(wave 1)

Audit of a catastrophic false-positive-coverage anti-pattern: tests like
`(type (min ['to 'ab 'c])) -- 'sym` or `(count (select …)) -- N` mark a
function as "covered" in llvm-cov while never verifying its actual
result value — so a function returning wrong values would silently pass.

Five parallel agents swept five function categories and added ~290
value-level assertions across 60 RFL files, pairing each surviving
type/count check with a companion value assertion computed by hand from
the input fixture:

  math/agg      min/max/sum/avg/var/stddev/first/last/top/bot/abs/neg/
                sqrt/log/exp/round/floor/ceil/pow  (~70 asserts)
  collection    at/take/distinct/asc/desc/concat/union/except/find/
                rank/group-keys                    (~50 asserts)
  cast/serde    as (de-tautologised ~40 (type (as 'T x))->'T), de/ser
                round-trips, parse/eval, temporal ctors  (~80 asserts)
  query         select/group/pivot/datalog/like/opt result values, not
                just row counts                    (~50 asserts)
  domain        knn/ann row identities, joins (inner/left/anti/asof)
                column values, graph algo results  (~40 asserts)

No src/ bugs were found: every mismatch investigated turned out to be
documented k/q convention (sum widens narrow ints to i64; null is a
distinct least value so (!= 0N 0)=true, (< 0N 0)=true, (== 0N 0)=false).
The fused_group_parity (!= v 0) case is now asserted at its correct
q-convention value (sum 5) with an explanatory comment rather than an
xfail.

Suite: 3232 of 3234 pass (2 skipped, 0 failed) under ASan+UBSan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(wave 2)

Continue the false-positive-coverage audit. 90 RFL files had sites that
asserted only a function's result TYPE or row/group COUNT, never its
VALUE -- so llvm-cov marked the code "covered" while wrong values passed
silently. Each weak site now carries a hand-computed value check beside
(never replacing) the original type/count line.

Per-group results are verified order-independently via key-masked SOURCE
lookup (select {agg by: k from: T where: (== k <key>)}) because group
emission order is non-deterministic. Float results use tolerance.

Real engine bugs surfaced by the stronger checks are pinned LIVE as
xfail (correct value in a ;; xfail BUG: comment, current buggy value
asserted) so the suite stays green while the bug stays visible:
  - I32 update memory corruption (non-selected rows clobbered)
  - I32 typed-scalar WHERE reads atom as 0 -> wrong group (3 repros)
  - parted STR/SYM all-pass filter returns un-materialized column
  - negative-zero F64 pivot zero-fills value column / collapses key cols
  - multi-key no-agg by-group ignores WHERE; drops non-key LIST column
  - non-by scalar pearson_corr returns 0 instead of correlation
  - narrow-int min/max wrong at group index >= 1

Full suite: 3232/3234 pass, 2 skipped, 0 failed (clean ASan/UBSan).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ser-vasilich ser-vasilich marked this pull request as draft June 5, 2026 11:06
@ser-vasilich ser-vasilich marked this pull request as ready for review June 5, 2026 11:46
ser-vasilich and others added 6 commits June 5, 2026 14:50
Two group-by aggregation bugs surfaced by the weak-assertion audit.

1. Narrow-int min/max wrong past group index 0. The emit path stored results
   with a fixed 8-byte int64 write, overshooting a 4/2/1-byte I32/I16/U8
   output column so every group after index 0 landed at the wrong offset
   (correct only at gi==0; single-group masked queries always correct). Store
   at the output element width via topk_write_i64 + ray_sym_elem_size, in both
   the DA and HT emitters. The all-null-group null sentinel is likewise
   widened (agg_int_null_sentinel_for) so it is not truncated to 0.

2. Non-by scalar pearson_corr returned integer 0. The n_keys==0 scalar
   fast-path lacked the binary-aggregator guard the DA path already has, so
   Pearson accumulated nothing and fell to the integer emit default. Skip the
   scalar fast-path when a binary aggregator is present, routing to the HT
   reducer that computes the cross-terms.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three query-execution-path bugs surfaced by the weak-assertion audit.

1. I32 typed-atom mishandling (two symptoms, two root causes, both I32-only
   because 4 != 8).
   a. WHERE-branch UPDATE memory corruption. The merge loop read the source
      column at a hardcoded 8-byte stride; for an I32 column (4-byte elems)
      this mis-indexed and read past the end, clobbering the rows NOT selected
      by WHERE -- e.g. update {v:0Ni from:t where:(==k 1i)} over k=[1 2 3]
      v=[10 20 30] yielded v=[null 30 garbage] instead of [null 20 30]. Stride
      by ray_elem_size(ct). (query.c)
   b. Typed I32 scalar in a WHERE filter read as 0. The fused-expr unary CAST
      I32->I64 branch had no I32 source case and fell into the f64 fallback,
      reinterpreting the 4-byte int as a double (small values -> 0), so
      select {from:t where:(== k (as 'I32 7))} selected the k=0 group. Add
      I32/I16/DATE/TIME source arms. (expr.c)

2. Multi-key no-agg by-group ignored WHERE. select {by:[k1 k2] from:t where:...}
   with no aggregate routed to a computed-key fallback that re-evaluated the
   by SYM-vector as a literal symbol list, grouping the UNFILTERED columns
   (one group per key name, not per distinct composite value). Route the
   n_out==0 multi-key case to eval-level grouping (mirroring the existing GUID
   case), which groups the filtered rows by the real composite key. (query.c)
   Follow-up: the eval-group path returns only key columns for this shape; the
   first-of-group non-key columns are carried by the next commit.

3. Non-key LIST column dropped under WHERE. sel_compact allocated result
   columns with typed_vec_new, but RAY_LIST==0 is rejected by ray_vec_new, so
   the column was left NULL and silently dropped (INT columns survived; LIST-
   specific) -- select {by:k from:t where:...} lost a non-key LIST column.
   Allocate LIST via ray_list_new and gather its element pointers with a
   retain, excluding LIST from the non-retaining byte-copy gather paths to
   avoid an alias/double-free. (filter.c)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
An all-pass filter on a splayed/parted table (e.g. where: (like s "*"), or a
select with no where:) hit the no-selection short-circuit and returned the
source table verbatim, leaving its columns in segmented (RAY_IS_PARTED) /
RAY_MAPCOMMON form. A subsequent (at result 'col) then raised a type error,
while a partial filter (non-NULL selection) flattened correctly via
sel_compact -- the defect was specific to the no-selection finalization path.

Flatten residual parted/mapcommon columns at finalization into a freshly
built table (never mutating the possibly-stored source), matching the
partial-filter path. Reuses the OP_SCAN cold-path flatten via a new
flatten_parted_col helper. Flat in-memory tables take the zero-copy
short-circuit unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the multi-key no-agg WHERE fix. After routing
select {by:[k1 k2] from:t} (no aggregate) to the eval-group path, the result
carried only the key columns -- the single-key path returns the first-of-group
value of every non-key column, so the multi-key shape silently dropped them
(e.g. column 'val).

Gather the first-of-group value of each non-key column for the n_out==0
multi-key case, mirroring the single-key first-of-group gather (typed
STR/LIST/scalar copy preserving native type and nulls), rather than
ray_at_fn which wraps a vector index into a LIST. select {by:[k1 k2] from:t}
now returns the key columns plus native-typed first-of-group non-key columns,
matching select {by:k1 from:t}.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CI (macos + ubuntu) surfaced three audit assertions that pinned a specific
emission order which is actually platform-dependent:

- db_mount `(key M)` / `(key Mp)`: a directory-mount dict is keyed by table
  names in filesystem readdir order, which differs across platforms (and from
  the local dev box). Compare the sorted set of names.
- MST edge order: Kruskal's emit order tie-breaks equal-weight edges via an
  unstable weight sort, so the order (and, for a 3-cycle with unit weights,
  the very edge SET) varies by platform. graph_basic / traverse_branch_cov:
  assert sorted edge columns + total weight where the MST set is unique;
  drop the per-edge pins where the set itself is non-unique (keep count +
  total weight).
- shortest-path: several equal-length 0->5 routes exist; which BFS returns
  depends on adjacency iteration order. Assert node count, monotonic per-hop
  depth, and the fixed endpoints, not the specific intermediate nodes.

No engine behavior change; only the test assertions are made robust.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A pivot whose F64 KEY column holds both 0.0 and -0.0 split them into two
groups: the group-by equality key compares raw bits (0x8000.. != 0x0) while
the column name normalizes via clear_neg_zero, so two same-named "0" columns
were emitted (and the value column zero-filled in the lambda fallback).

Normalize the pivot KEY column once, up front, in ray_pivot_fn (a thin wrapper
around the now-static pivot_fn_impl): if the F64 key holds any -0.0, build a
shallow table copy whose pivot column has -0.0 collapsed to +0.0 via
clear_neg_zero, then run the pivot on it. Both the DAG fast path and the
generic lambda fallback then group 0.0/-0.0 together.

The merge must happen BEFORE grouping (so (index,0.0) and (index,-0.0) land in
one group and sum); normalizing after the group-by would overwrite, not merge.

Cold path only: one O(n) bit-check scan of the pivot key (early-exit when no
-0.0), and a table copy solely when -0.0 is present. The group-by per-row hot
loop and its -fno-signed-zeros vectorization are untouched. clear_neg_zero is
an integer bit-check, so 0.0/-0.0 collapse identically across linux/macos and
debug/release.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ser-vasilich ser-vasilich force-pushed the test/weak-assertion-audit branch from be28802 to 33f0b46 Compare June 5, 2026 11:51
@ser-vasilich ser-vasilich marked this pull request as draft June 5, 2026 11:55
…prox

ray_count_distinct_approx allocates one HLL shard per worker via scratch_calloc
(zero-initialized: regs==NULL, m==0) and merges shards[1..nw) into shards[0]
before estimating. Under dynamic worker scheduling a thread — including the one
that owns shards[0] — may process no chunk, leaving shards[0] zero-initialized.
ray_hll_merge(dst=shards[0], src) then hits its precision guard (dst->m 0 !=
src->m) and returns without merging, so every merge is silently skipped and the
estimate falls to 0 (e.g. "RAY_I16 big: got 0, expected ~250").

This is non-deterministic — it only fires when the shards[0]-owning worker
happens to stay idle — which is why it surfaced as an intermittent CI failure
on macos-release.

Merge into the first shard that actually saw data (the first with regs != NULL)
and estimate from that, instead of assuming shards[0] is initialized.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ser-vasilich ser-vasilich marked this pull request as ready for review June 5, 2026 12:13
@ser-vasilich ser-vasilich changed the title fix: 8 engine bugs surfaced by a weak-assertion coverage audit fix: 9 engine bugs surfaced by a weak-assertion coverage audit Jun 5, 2026
@singaraiona singaraiona merged commit 7ba4c7f into master Jun 5, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants