refactor: switch transport from Parquet to Arrow IPC by paddymul · Pull Request #549 · buckaroo-data/buckaroo

paddymul · 2026-02-23T12:22:03Z

Summary

Replace fastparquet (Python) + hyparquet (JS) with pyarrow.ipc + @uwdata/flechette for binary data transport between Python and JS.

Drop fastparquet from core dependencies — key enabler for making pandas optional and for WASM/Pyodide compatibility
Synchronous JS decoding with tableFromIPC() eliminates the async/sync callback duality from hyparquet
Backend-agnostic serialization: works with both pandas (pa.Table.from_pandas()) and polars (df.to_arrow())
Backward compatibility aliases preserved (to_parquet = to_arrow_ipc, etc.)

Performance and bundle size analysis

JS bundle size impact

Bundle	main	This PR	Delta
`widget.js` (Jupyter)	3,514,857 B	3,519,203 B	+4,346 B (+0.12%)

The switch from hyparquet (~10 KB min+gz) to flechette (~14 KB min+gz) adds roughly +4 KB to the final widget bundle — negligible.

Python dependency impact

Dependency	Install size	Status
`fastparquet`	4.1 MB	Removed from core deps
`cramjam` (fastparquet transitive)	7.8 MB	No longer pulled in
`pyarrow`	113 MB	Already a hard dep (unchanged)

Net savings: ~12 MB removed from install footprint. More importantly, cramjam is a compiled Rust extension with no WASM wheel, so removing it unblocks Pyodide/WASM deployment.

Decoding architecture: why Arrow IPC is better for widget transport

Arrow IPC streaming format is fundamentally different from Parquet for this use case:

	Arrow IPC	Parquet
Decode model	Buffer reinterpretation (near-zero cost)	Decompress → reverse encoding → assemble row groups
JS API	`tableFromIPC(buf)` — synchronous	`parquetRead({file, onComplete})` — async only
Python encode cost	`ipc.new_stream()` — trivial, Arrow-native	`fastparquet.write()` — compression + dictionary/RLE encoding
Wire size	Larger (uncompressed by default)	Smaller (compressed)
Best for	In-process IPC, widget transport	Remote file access, HTTP range requests

The wire size tradeoff doesn't matter here — data travels over a local Jupyter websocket or kernel comm channel, not the internet. The decode speed and API simplicity advantages dominate.

Concrete wins from synchronous decoding

The old hyparquet code path had a subtle bug surface area because parquetRead fires its onComplete callback asynchronously in some bundler environments (esbuild standalone) but synchronously in others (webpack/Jupyter). This forced us to maintain both resolveDFData() (sync, returns [] if callback hasn't fired yet) and resolveDFDataAsync() (async, wraps in Promise) plus a preResolveDFDataDict() pre-resolution step.

With flechette, tableFromIPC() is always synchronous. resolveDFDataAsync now trivially wraps the sync version, and preResolveDFDataDict no longer needs Promise.all. This eliminates an entire class of timing bugs where summary stats would briefly render as empty arrays before the async decode completed.

Row extraction performance

From flechette's published benchmarks (vs the official apache-arrow JS library):

Operation	Flechette speedup
Row object extraction (`table.toArray()`)	7–11x faster
Array extraction	2–7x faster
Value iteration	1.3–1.6x faster

We use table.toArray() (row object extraction) on every infinite scroll response and every summary stats decode — this is the hot path. The 7–11x speedup over apache-arrow JS is comparing Arrow libraries; the advantage over hyparquet's Parquet decode path is even larger since Parquet decoding has additional decompression overhead on top of extraction.

Files changed

Python (8 files)

serialization_utils.py: to_arrow_ipc() and sd_to_ipc_b64() replace fastparquet-based functions
All callers updated: buckaroo_widget, polars_buckaroo, lazy_infinite_polars_widget, geopandas_buckaroo, dataflow/dataflow, dataflow/column_executor_dataflow, server/data_loading

JS/TS (7 files)

resolveDFData.ts: synchronous tableFromIPC() replaces async parquetRead
BuckarooWidgetInfinite.tsx: flechette for infinite scroll buffer parsing
DFWhole.ts: added IpcB64Payload type to union
index.ts, widget.tsx, standalone.tsx: updated imports and transcript recording

Config

pyproject.toml: removed fastparquet from core dependencies
package.json: replaced hyparquet with @uwdata/flechette

Tests (4 files)

Updated to use to_arrow_ipc, _read_ipc_to_polars / _read_ipc_buffer helpers
Updated _resolve_all_stats to handle ipc_b64 format

Test plan

uv run pytest tests/unit/serialization_utils_test.py -v — serialization round-trip
uv run pytest tests/unit -v — 514 tests pass
cd packages/buckaroo-js-core && pnpm test — JS unit tests
Full build with ./scripts/full_build.sh
Manual: Jupyter notebook with infinite scroll widget

🤖 Generated with Claude Code

github-actions · 2026-02-23T12:24:03Z

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.9.dev22340155296

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.9.dev22340155296

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.12.9.dev22340155296" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

Replace fastparquet + hyparquet with pyarrow.ipc + @uwdata/flechette for binary data transport between Python and JS. Python: - serialization_utils.py: to_arrow_ipc() and sd_to_ipc_b64() replace fastparquet-based functions, using pyarrow.ipc streaming format - All callers updated (buckaroo_widget, polars_buckaroo, lazy_infinite, geopandas, dataflow, column_executor_dataflow, server/data_loading) - Backward compat aliases: to_parquet = to_arrow_ipc, etc. JS: - resolveDFData.ts: synchronous tableFromIPC() replaces async hyparquet - BuckarooWidgetInfinite.tsx: flechette for infinite scroll buffer parsing - widget.tsx, standalone.tsx, index.ts: updated imports and transcript recording Config: - pyproject.toml: removed fastparquet from core dependencies - package.json: replaced hyparquet with @uwdata/flechette Benefits: - Drops fastparquet dep (key enabler for making pandas optional) - Synchronous JS decoding (eliminates async/sync callback duality) - Backend-agnostic (works with both pandas and polars via pyarrow) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Guard buckaroo_widget/widget_utils/widget_extension_utils imports in __init__.py with try/except so `import buckaroo` works without pandas - Lazy-import pandas in server/data_loading.py so the server module can be imported in [mcp]-only environments (polars + pyarrow, no pandas) - Use TYPE_CHECKING guard in df_util.py to avoid top-level pandas import - Remove unused sd_to_ipc_b64 import from serialization_utils_test.py - Make session.py use Optional[Any] instead of Optional[pd.DataFrame] These changes enable the Server Playwright Tests to pass since the PR removes fastparquet (which transitively provided pandas) from core deps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- smoke_test.py: base test no longer requires pandas since fastparquet was removed from core deps; pandas-specific checks run only if pandas is available - buckaroo_mcp_tool.py: use mode="lazy" (polars) instead of mode="buckaroo" (pandas) since [mcp] extras don't include pandas Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add pandas to [polars] extras since PolarsBuckarooWidget inherits from BuckarooWidget which requires pandas - Add pandas to server Playwright test venv (mode='buckaroo' needs it) - Fix JupyterLab Playwright test: use .first() for cell locators since synchronous IPC decoding now renders summary stats pinned rows immediately (where most_freq can match data values like 'Alice') Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

marimo_utils.py imports pandas at top level and BuckarooDataFrame inherits from pd.DataFrame, so pandas is a real dependency for the [marimo] extras group. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Previous runs had Server Playwright cancelled due to concurrency group. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

writeTempParquet() was using pandas via `uv run python`, but in CI the server Playwright tests run in a clean [mcp] venv where `uv run` doesn't resolve to a python with pandas. Switch to polars (which is in the [mcp] extras) and use BUCKAROO_SERVER_PYTHON when available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The committed standalone.js was never rebuilt after switching from hyparquet to flechette. In CI, `python -m buckaroo.server` runs from the repo root, so the local buckaroo/ directory shadows the installed wheel — causing the server to serve the stale hyparquet-based bundle. Buckaroo-mode tests failed because the JS tried to decode Arrow IPC binary frames with the old parquet decoder. Also adds diagnostic logging to test_playwright_server.sh to verify the installed standalone.js content in CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds a diagnostic Playwright test that captures: - Server diagnostics (static_path, dependencies) - Browser console logs (errors, warnings) - Page content at 2s intervals (root text, AG Grid element counts) Also fixes test_playwright_server.sh diagnostic to use importlib instead of `import buckaroo` (which prints to stdout). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds direct WebSocket capture in the diagnostic test to log: - mode, df_meta, df_display_args keys and data_key values - df_data_dict structure, buckaroo_state This will reveal if the server sends data_key='empty' instead of 'main', which would cause AG Grid to use clientSide mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

pd.api.types.is_numeric was removed in pandas 3.0. The get_mode() function called it without arguments; fix to is_numeric_dtype(ser). This was causing the entire analysis pipeline to fail silently in CI (clean venv with pandas 3.0.1), leaving df_display_args with data_key='empty' and preventing buckaroo mode from rendering data. Also removes the temporary diagnostic test added for debugging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

paddymul · 2026-02-24T13:55:08Z

@codex

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 432029d7af

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-24T14:01:18Z

buckaroo_mcp_tool.py


    prompt = f"Viewing {os.path.basename(path)}"
-    payload = json.dumps({"session": SESSION_ID, "path": path, "mode": "buckaroo", "prompt": prompt}).encode()
+    payload = json.dumps({"session": SESSION_ID, "path": path, "mode": "lazy", "prompt": prompt}).encode()


Preserve JSON file support in MCP view mode

Switching the MCP load payload to "mode": "lazy" routes .json files through load_file_lazy, which uses pl.scan_ndjson for JSON inputs; that parser expects newline-delimited JSON and will reject common JSON array/object files that previously loaded via pandas in buckaroo mode. Since view_data/buckaroo_table are documented for JSON files, this change causes real user-visible failures for standard JSON datasets.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-24T14:01:18Z

buckaroo/__init__.py

+    from .widget_utils import is_in_ipython, is_in_marimo, enable, disable, determine_jupter_env
+    from .dataflow.widget_extension_utils import DFViewer
+    _HAS_PANDAS = True
+except ImportError:


Restrict import fallback to missing pandas only

Catching a blanket ImportError here suppresses all widget-stack import failures, not just the intended “pandas not installed” case. If any transitive import regression (or missing core dependency) raises ImportError, buckaroo will silently set _HAS_PANDAS=False and disable notebook exports/initialization instead of surfacing the real error, which makes breakages much harder to detect and debug.

Useful? React with 👍 / 👎.

paddymul temporarily deployed to testpypi February 23, 2026 12:23 — with GitHub Actions Inactive

paddymul and others added 2 commits February 23, 2026 22:54

paddymul force-pushed the refactor/arrow-ipc-transport branch from 7cf74a7 to 9ec5e7f Compare February 24, 2026 04:04

paddymul temporarily deployed to testpypi February 24, 2026 04:05 — with GitHub Actions Inactive

paddymul temporarily deployed to testpypi February 24, 2026 04:10 — with GitHub Actions Inactive

paddymul temporarily deployed to testpypi February 24, 2026 04:27 — with GitHub Actions Inactive

fix: add pandas to [marimo] extras (needed for BuckarooDataFrame)

1e771f5

marimo_utils.py imports pandas at top level and BuckarooDataFrame inherits from pd.DataFrame, so pandas is a real dependency for the [marimo] extras group. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

paddymul temporarily deployed to testpypi February 24, 2026 04:31 — with GitHub Actions Inactive

ci: re-trigger CI for Server Playwright Tests

b178ddb

Previous runs had Server Playwright cancelled due to concurrency group. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

paddymul temporarily deployed to testpypi February 24, 2026 04:56 — with GitHub Actions Inactive

paddymul temporarily deployed to testpypi February 24, 2026 05:10 — with GitHub Actions Inactive

paddymul force-pushed the refactor/arrow-ipc-transport branch from d014c1c to a3a0406 Compare February 24, 2026 05:27

paddymul temporarily deployed to testpypi February 24, 2026 05:28 — with GitHub Actions Inactive

paddymul temporarily deployed to testpypi February 24, 2026 06:15 — with GitHub Actions Inactive

paddymul temporarily deployed to testpypi February 24, 2026 06:33 — with GitHub Actions Inactive

paddymul temporarily deployed to testpypi February 24, 2026 06:53 — with GitHub Actions Inactive

paddymul temporarily deployed to testpypi February 24, 2026 06:59 — with GitHub Actions Inactive

simple commit so openAI does a review

432029d

chatgpt-codex-connector bot reviewed Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

refactor: switch transport from Parquet to Arrow IPC#549

refactor: switch transport from Parquet to Arrow IPC#549
paddymul wants to merge 12 commits intomainfrom
refactor/arrow-ipc-transport

paddymul commented Feb 23, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 23, 2026 •

edited

Loading

Uh oh!

paddymul commented Feb 24, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 24, 2026

Uh oh!

chatgpt-codex-connector bot Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

paddymul commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance and bundle size analysis

JS bundle size impact

Python dependency impact

Decoding architecture: why Arrow IPC is better for widget transport

Concrete wins from synchronous decoding

Row extraction performance

Files changed

Python (8 files)

JS/TS (7 files)

Config

Tests (4 files)

Test plan

Uh oh!

github-actions bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 TestPyPI package published

MCP server for Claude Code

Uh oh!

paddymul commented Feb 24, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

paddymul commented Feb 23, 2026 •

edited

Loading

github-actions bot commented Feb 23, 2026 •

edited

Loading