Skip to content

feat(xorq): XorqBuckarooWidget — buckaroo against an ibis/xorq expression#703

Merged
paddymul merged 13 commits intomainfrom
feat/xorq-buckaroo-widget
May 6, 2026
Merged

feat(xorq): XorqBuckarooWidget — buckaroo against an ibis/xorq expression#703
paddymul merged 13 commits intomainfrom
feat/xorq-buckaroo-widget

Conversation

@paddymul
Copy link
Copy Markdown
Collaborator

@paddymul paddymul commented May 5, 2026

Summary

  • XorqBuckarooWidget(expr) — buckaroo over a xorq/ibis expression. Stats push down through XorqDfStatsV2 (one batched aggregate + per-column histogram queries); only a display-sized sample lands in pandas via expr.limit(N).execute().
  • add_processing(fn) takes an expression-to-expression function so postprocessing also pushes down — filtered stats reflect the filtered expression without re-materialising.
  • BuckarooWidgetBase gains a dataflow_klass class attribute so the inner dataflow can be swapped without duplicating __init__. Pandas widget unchanged.

What's new in buckaroo/xorq_buckaroo.py

  • XorqSamplingserialize_sample does expr.limit().execute(); pre-stats sampling is a no-op (the point is push-down).
  • XorqAutocleaning — pass-through. The lisp interpreter targets pandas DataFrames; for ibis exprs we skip cleaning entirely.
  • XorqDataflow — overrides populate_df_meta to use expr.count().execute() instead of len(), and rewrites the summary-dict keys from original column names (what XorqStatPipeline produces) to the rewritten a, b, c space the frontend works in.
  • XorqBuckarooWidget — wires the above together with XORQ_STATS_V2 + the default styling klasses, and exposes add_processing(expr_func).

Test plan

  • 9 new unit tests in tests/unit/test_xorq_buckaroo_widget.py cover instantiation, df_meta from count(), column rewrite, push-down filter via add_processing, switching back to no-op processing, and the error-frame fallback when a postprocessor raises
  • full unit suite (931 tests) still passes
  • manually verified: XorqBuckarooWidget(memtable).add_processing(lambda e: e.filter(e.qty > 2)) → 4-row filtered display, stats min/max reflect the filtered expr (9.9 / 50.0 vs 7.4 / 50.0 unfiltered)

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.13.5.dev25438596086

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.13.5.dev25438596086

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.13.5.dev25438596086" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 202c5e44e0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread buckaroo/xorq_buckaroo.py Outdated
paddymul added a commit that referenced this pull request May 6, 2026
Subprocess + meta-path blocker — importing ``buckaroo.xorq_buckaroo``
fails fast if polars (or any polars-only buckaroo module) gets pulled
in transitively, and vice versa for the polars side. Catches the
regression Codex flagged on PR #703 (NoCleaningConfPl import) before it
reaches a ``buckaroo[xorq]`` install with no polars.

Both extras are optional; cross-imports turn one extra into a hard
requirement for the other and break the advertised install paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread buckaroo/xorq_buckaroo.py Outdated
paddymul added a commit that referenced this pull request May 6, 2026
Three checks were red, all from the last test push:

- JS / Build + Test, Build JS + Python Wheel, ReadTheDocs (all run
  ``tsc``): xorqWindow.dom.test.tsx used ``{ displayer: "float" }``
  which is missing the required ``min_fraction_digits`` /
  ``max_fraction_digits`` from FloatDisplayerA. Add them.

- Python / Lint (ruff F841): the multi-stage step test bound
  ``df = pd.read_parquet(...)`` but never used it. Renamed to
  ``_bufs`` and dropped the call.

Verified locally: ruff clean, tsc -b passes, vite build succeeds,
36 python tests + 2 dom tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
paddymul and others added 13 commits May 6, 2026 09:34
…sion

Stats push down via XorqDfStatsV2 (one batched aggregate + per-column
histogram queries); only a display-sized sample lands in pandas. add_processing
takes an expression-to-expression function so postprocessing also pushes down
— filtered stats reflect the filtered expr without re-materialising.

- buckaroo/xorq_buckaroo.py: XorqBuckarooWidget plus XorqSampling (limit+execute),
  XorqAutocleaning (no-op; the lisp interpreter targets pandas), and XorqDataflow
  (count() for df_meta, rewrites SD keys orig→a/b/c since XorqStatPipeline keeps
  orig names but the frontend works in rewritten space).
- buckaroo/buckaroo_widget.py: dataflow_klass class attribute on
  BuckarooWidgetBase so the inner dataflow can be swapped without
  duplicating __init__. Pandas widget unchanged.
- tests/unit/test_xorq_buckaroo_widget.py: instantiation, df_meta from
  count(), column rewrite, push-down filter through add_processing,
  switch-back, error-frame fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
XorqBuckarooInfiniteWidget pushes pagination down to the backend.
Each infinite_request slices via expr.limit(end - start, offset=start).execute();
sort fires expr.order_by(asc/desc).limit(...) on the backend.
df_data_dict.main stays empty — pagination drives loading. The
total length in each response is one expr.count().execute(); the
small-window pandas frame is what crosses the parquet boundary.

Search wired through quick_command_args. XorqSearch carries the
'search' lisp symbol and the frontend quick-args pattern, but
the actual filter dispatch happens directly in XorqAutocleaning —
the lisp interpreter built by configure_buckaroo bakes in
df.copy()/df.clone(), neither of which works on immutable ibis
exprs. _xorq_search builds an OR of contains() across all string
columns; an empty value clears the filter.

NoCleaningConfXorq replaces NoCleaningConfPl in the autocleaning
config so command symbols are registered for the frontend.

Tests: 9 new (3 search behaviours, 6 infinite widget — slice,
sort pushdown, second_request piggyback, search-then-paginate).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Subprocess + meta-path blocker — importing ``buckaroo.xorq_buckaroo``
fails fast if polars (or any polars-only buckaroo module) gets pulled
in transitively, and vice versa for the polars side. Catches the
regression Codex flagged on PR #703 (NoCleaningConfPl import) before it
reaches a ``buckaroo[xorq]`` install with no polars.

Both extras are optional; cross-imports turn one extra into a hard
requirement for the other and break the advertised install paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run paddy-format on the two test files added this branch — collapses
multiline literals with trailing commas to a single line. CI lint job
was failing for these two files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes to address paddymul's note that it wasn't obvious that the
expression stays lazy until a *bounded* execute on a small slice:

- Inline the slicing inside _handle_payload_args. Sort and slice flow
  through one branch, the explicit ``expr.limit(end - start, offset=start)
  .execute()`` lives in _execute_window (renamed from _sliced_pandas)
  with a docstring spelling out that this is the only place the expr
  is materialised, and only over rows [start, end).

- Refuse a bare expr.execute() in XorqSampling.serialize_sample.
  Previously the ``serialize_limit == 0`` else branch would have done
  ``df_or_expr.execute()`` — unreachable in practice (basic widget has
  serialize_limit=5_000; infinite widget skips this path via
  skip_main_serial=True), but exactly the kind of thing the reviewer
  was guarding against. Now an unbounded serialize_limit raises
  RuntimeError instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror the polars infinite widget's wire path: arrow → parquet, no
pandas detour. The ibis branch of _window_to_parquet (renamed from
_execute_window) goes:

    expr.select(rename).limit(end - start, offset=start)
        .to_pyarrow()  ->  pa.Table  ->  pq.write_table

instead of `.execute()` (pandas DataFrame) followed by fastparquet.
The pandas branch (postprocessor materialised, error frame) still
flows through fastparquet — there's no expression to push down.

Side benefit of the arrow path: string columns travel as native
parquet UTF8 instead of fastparquet's JSON-encoded strings.

Tests around the infinite behaviour — went from 18 to 35:

- TestInfiniteWindowEdges: partial-at-end window, past-end empty,
  index-column carries absolute offsets across pages.
- TestInfiniteSort: descending, by string column, with non-zero
  offset, second_request skipped when sorted.
- TestInfinitePostprocessing: filter-returning processor stays
  push-down; pandas-returning processor takes the in-process branch.
- TestInfiniteErrorPath: bad sort column emits error_info + length=0.
- TestLazyPostprocessor: postprocessor body runs without triggering
  any materialisation (pandas or arrow); paginates with bounded plan.
- TestInfiniteBoundedExecution: the central PR-review invariant —
  the only ops materialised are CountStar (aggregate) and Limit
  (windowed slice), never InMemoryTable. Includes a no-pandas-
  detour spy that confirms the row window goes through to_pyarrow,
  never .execute().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two separate _handle_payload_args calls — [0, 50) then a jump to
[300, 350) — confirm the infinite widget handles scroll-jumps without
materialising rows outside either window. Each request emits exactly
one CountStar (total) and one Limit (slice); InMemoryTable never
appears in the spy log. This exercises the contract a frontend would
hit when the user scrolls past the loaded buffer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The xorq infinite widget switched to arrow-direct serialization
(expr.to_pyarrow() -> pyarrow.parquet.write_table) instead of
pandas / fastparquet. The wire format changes — most notably string
columns travel as native parquet UTF8 instead of fastparquet's
JSON-encoded form — so we need a contract test that the JS frontend's
hyparquet reader handles the new bytes.

Pattern (mirrors the existing summary_stats_parquet_b64 fixture):

- scripts/generate_xorq_window_fixture.py — emits the fixture by
  running XorqBuckarooInfiniteWidget._handle_payload_args on a small
  deterministic dataset and base64-encoding the parquet response.
- tests/unit/test_xorq_window_fixture.py — re-runs the generator in
  memory and diffs every field (including the parquet bytes) against
  the committed fixture, so drift between Python output and the JS
  fixture surfaces in CI.
- packages/.../xorqWindow.test.ts — uses hyparquet to decode the
  fixture and asserts: row count, schema in the rewritten 'a, b,
  index' space, strings come through native UTF8 (not JSON-quoted),
  and the index column carries absolute window offsets.

Run scripts/generate_xorq_window_fixture.py to refresh the fixture
when the wire format intentionally changes; the drift test will tell
you if you forgot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xorqWindow.test.ts proved the parquet bytes are decodable; this test
adds the next layer: the decoded rows actually reach AgGridReact's
rowData with the right values. End-to-end:

    parquet bytes  →  hyparquet decode  →  DFViewerInfinite (Raw mode)
    →  AgGridReact rowData

Mocking AgGridReact (the same pattern DFViewerInfinite.test.tsx uses;
real AG-Grid uses virtual scrolling + canvas which jsdom can't render
cleanly) and inspecting ``latestAgGridProps.gridOptions.rowData`` is
the surface where 'what will be rendered into the grid DOM' is
captured. Asserts:

- row count matches the request window;
- index column carries absolute window offsets;
- name strings come through native UTF8 (no JSON quotes — proves the
  arrow-direct serialization survives the full pipeline);
- price column is numeric (not strings);
- column_defs' ``field`` keys match what's in rowData (otherwise
  AG-Grid would render blank cells).

Sanitises BigInt → Number on decode, mirroring resolveDFData's
parseParquetRow. Hyparquet decodes parquet INT64 as JS BigInt, and
DFViewerInfinite's Raw path JSON-stringifies rowData for a memo
signature; without sanitising, the test crashes on a BigInt that
production never encounters (production uses DataSource mode).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end check that the arrow-direct serialization (Python) lines up
with hyparquet decoding (JS) and lands real values in real AG-Grid
cells. Loads a 5000-row XorqBuckarooInfiniteWidget into JupyterLab,
exercises three scenarios:

1. Initial render — row 0's str_col cell ("foo_10") is visible and
   matches the predictable invariant
   ``str_col == "foo_{row_num + 10}"``.

2. Deep scroll — scrolls ~30000px, waits for the follow-up
   infinite_request to round-trip, then reads every visible body row
   (filtered to ``[row-index]`` so the pinned dtype row doesn't leak
   in) and asserts the same invariant. The min visible row_num must be
   > 500 — proves we genuinely scrolled rather than re-rendering the
   top.

3. Jump back to top — after a deep scroll, scrollTop=0 and verify
   row 0 is rendered again (cache hit or re-fetch, both fine).

Test setup follows the existing infinite-scroll-transcript.spec.ts
pattern: notebook lives in tests/integration_notebooks/, the runner
script copies it into the JupyterLab cwd, and the spec file is
matched by playwright.config.integration.ts. The script's
notebook→spec routing now sends test_xorq_infinite_scroll.ipynb to
xorq-infinite-scroll.spec.ts with a 60s timeout (xorq + DataFusion
startup is heavier than polars).

Verified locally: 3/3 sub-tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three checks were red, all from the last test push:

- JS / Build + Test, Build JS + Python Wheel, ReadTheDocs (all run
  ``tsc``): xorqWindow.dom.test.tsx used ``{ displayer: "float" }``
  which is missing the required ``min_fraction_digits`` /
  ``max_fraction_digits`` from FloatDisplayerA. Add them.

- Python / Lint (ruff F841): the multi-stage step test bound
  ``df = pd.read_parquet(...)`` but never used it. Renamed to
  ``_bufs`` and dropped the call.

Verified locally: ruff clean, tsc -b passes, vite build succeeds,
36 python tests + 2 dom tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walks through the regular and infinite-scroll variants against citibike
parquet (100k rows) and a 1.2M-row NumPy-generated frame, plus a lazy
postprocessing demo via expr.filter().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@paddymul paddymul force-pushed the feat/xorq-buckaroo-widget branch from 16711c9 to 28db376 Compare May 6, 2026 13:34
@paddymul paddymul deployed to testpypi May 6, 2026 13:36 — with GitHub Actions Active
@paddymul paddymul added this pull request to the merge queue May 6, 2026
Merged via the queue into main with commit 4ef5916 May 6, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant