Skip to content

Comments

refactor: switch transport from Parquet to Arrow IPC#549

Open
paddymul wants to merge 12 commits intomainfrom
refactor/arrow-ipc-transport
Open

refactor: switch transport from Parquet to Arrow IPC#549
paddymul wants to merge 12 commits intomainfrom
refactor/arrow-ipc-transport

Conversation

@paddymul
Copy link
Collaborator

@paddymul paddymul commented Feb 23, 2026

Summary

Replace fastparquet (Python) + hyparquet (JS) with pyarrow.ipc + @uwdata/flechette for binary data transport between Python and JS.

  • Drop fastparquet from core dependencies — key enabler for making pandas optional and for WASM/Pyodide compatibility
  • Synchronous JS decoding with tableFromIPC() eliminates the async/sync callback duality from hyparquet
  • Backend-agnostic serialization: works with both pandas (pa.Table.from_pandas()) and polars (df.to_arrow())
  • Backward compatibility aliases preserved (to_parquet = to_arrow_ipc, etc.)

Performance and bundle size analysis

JS bundle size impact

Bundle main This PR Delta
widget.js (Jupyter) 3,514,857 B 3,519,203 B +4,346 B (+0.12%)

The switch from hyparquet (~10 KB min+gz) to flechette (~14 KB min+gz) adds roughly +4 KB to the final widget bundle — negligible.

Python dependency impact

Dependency Install size Status
fastparquet 4.1 MB Removed from core deps
cramjam (fastparquet transitive) 7.8 MB No longer pulled in
pyarrow 113 MB Already a hard dep (unchanged)

Net savings: ~12 MB removed from install footprint. More importantly, cramjam is a compiled Rust extension with no WASM wheel, so removing it unblocks Pyodide/WASM deployment.

Decoding architecture: why Arrow IPC is better for widget transport

Arrow IPC streaming format is fundamentally different from Parquet for this use case:

Arrow IPC Parquet
Decode model Buffer reinterpretation (near-zero cost) Decompress → reverse encoding → assemble row groups
JS API tableFromIPC(buf)synchronous parquetRead({file, onComplete})async only
Python encode cost ipc.new_stream() — trivial, Arrow-native fastparquet.write() — compression + dictionary/RLE encoding
Wire size Larger (uncompressed by default) Smaller (compressed)
Best for In-process IPC, widget transport Remote file access, HTTP range requests

The wire size tradeoff doesn't matter here — data travels over a local Jupyter websocket or kernel comm channel, not the internet. The decode speed and API simplicity advantages dominate.

Concrete wins from synchronous decoding

The old hyparquet code path had a subtle bug surface area because parquetRead fires its onComplete callback asynchronously in some bundler environments (esbuild standalone) but synchronously in others (webpack/Jupyter). This forced us to maintain both resolveDFData() (sync, returns [] if callback hasn't fired yet) and resolveDFDataAsync() (async, wraps in Promise) plus a preResolveDFDataDict() pre-resolution step.

With flechette, tableFromIPC() is always synchronous. resolveDFDataAsync now trivially wraps the sync version, and preResolveDFDataDict no longer needs Promise.all. This eliminates an entire class of timing bugs where summary stats would briefly render as empty arrays before the async decode completed.

Row extraction performance

From flechette's published benchmarks (vs the official apache-arrow JS library):

Operation Flechette speedup
Row object extraction (table.toArray()) 7–11x faster
Array extraction 2–7x faster
Value iteration 1.3–1.6x faster

We use table.toArray() (row object extraction) on every infinite scroll response and every summary stats decode — this is the hot path. The 7–11x speedup over apache-arrow JS is comparing Arrow libraries; the advantage over hyparquet's Parquet decode path is even larger since Parquet decoding has additional decompression overhead on top of extraction.


Files changed

Python (8 files)

  • serialization_utils.py: to_arrow_ipc() and sd_to_ipc_b64() replace fastparquet-based functions
  • All callers updated: buckaroo_widget, polars_buckaroo, lazy_infinite_polars_widget, geopandas_buckaroo, dataflow/dataflow, dataflow/column_executor_dataflow, server/data_loading

JS/TS (7 files)

  • resolveDFData.ts: synchronous tableFromIPC() replaces async parquetRead
  • BuckarooWidgetInfinite.tsx: flechette for infinite scroll buffer parsing
  • DFWhole.ts: added IpcB64Payload type to union
  • index.ts, widget.tsx, standalone.tsx: updated imports and transcript recording

Config

  • pyproject.toml: removed fastparquet from core dependencies
  • package.json: replaced hyparquet with @uwdata/flechette

Tests (4 files)

  • Updated to use to_arrow_ipc, _read_ipc_to_polars / _read_ipc_buffer helpers
  • Updated _resolve_all_stats to handle ipc_b64 format

Test plan

  • uv run pytest tests/unit/serialization_utils_test.py -v — serialization round-trip
  • uv run pytest tests/unit -v — 514 tests pass
  • cd packages/buckaroo-js-core && pnpm test — JS unit tests
  • Full build with ./scripts/full_build.sh
  • Manual: Jupyter notebook with infinite scroll widget

🤖 Generated with Claude Code

@github-actions
Copy link

github-actions bot commented Feb 23, 2026

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.9.dev22340155296

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.9.dev22340155296

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.12.9.dev22340155296" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

paddymul and others added 2 commits February 23, 2026 22:54
Replace fastparquet + hyparquet with pyarrow.ipc + @uwdata/flechette for
binary data transport between Python and JS.

Python:
- serialization_utils.py: to_arrow_ipc() and sd_to_ipc_b64() replace
  fastparquet-based functions, using pyarrow.ipc streaming format
- All callers updated (buckaroo_widget, polars_buckaroo, lazy_infinite,
  geopandas, dataflow, column_executor_dataflow, server/data_loading)
- Backward compat aliases: to_parquet = to_arrow_ipc, etc.

JS:
- resolveDFData.ts: synchronous tableFromIPC() replaces async hyparquet
- BuckarooWidgetInfinite.tsx: flechette for infinite scroll buffer parsing
- widget.tsx, standalone.tsx, index.ts: updated imports and transcript recording

Config:
- pyproject.toml: removed fastparquet from core dependencies
- package.json: replaced hyparquet with @uwdata/flechette

Benefits:
- Drops fastparquet dep (key enabler for making pandas optional)
- Synchronous JS decoding (eliminates async/sync callback duality)
- Backend-agnostic (works with both pandas and polars via pyarrow)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Guard buckaroo_widget/widget_utils/widget_extension_utils imports in
  __init__.py with try/except so `import buckaroo` works without pandas
- Lazy-import pandas in server/data_loading.py so the server module can
  be imported in [mcp]-only environments (polars + pyarrow, no pandas)
- Use TYPE_CHECKING guard in df_util.py to avoid top-level pandas import
- Remove unused sd_to_ipc_b64 import from serialization_utils_test.py
- Make session.py use Optional[Any] instead of Optional[pd.DataFrame]

These changes enable the Server Playwright Tests to pass since the PR
removes fastparquet (which transitively provided pandas) from core deps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- smoke_test.py: base test no longer requires pandas since fastparquet
  was removed from core deps; pandas-specific checks run only if
  pandas is available
- buckaroo_mcp_tool.py: use mode="lazy" (polars) instead of
  mode="buckaroo" (pandas) since [mcp] extras don't include pandas

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add pandas to [polars] extras since PolarsBuckarooWidget inherits
  from BuckarooWidget which requires pandas
- Add pandas to server Playwright test venv (mode='buckaroo' needs it)
- Fix JupyterLab Playwright test: use .first() for cell locators since
  synchronous IPC decoding now renders summary stats pinned rows
  immediately (where most_freq can match data values like 'Alice')

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
marimo_utils.py imports pandas at top level and BuckarooDataFrame
inherits from pd.DataFrame, so pandas is a real dependency for the
[marimo] extras group.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previous runs had Server Playwright cancelled due to concurrency group.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
writeTempParquet() was using pandas via `uv run python`, but in CI
the server Playwright tests run in a clean [mcp] venv where `uv run`
doesn't resolve to a python with pandas. Switch to polars (which is
in the [mcp] extras) and use BUCKAROO_SERVER_PYTHON when available.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The committed standalone.js was never rebuilt after switching from
hyparquet to flechette. In CI, `python -m buckaroo.server` runs from
the repo root, so the local buckaroo/ directory shadows the installed
wheel — causing the server to serve the stale hyparquet-based bundle.
Buckaroo-mode tests failed because the JS tried to decode Arrow IPC
binary frames with the old parquet decoder.

Also adds diagnostic logging to test_playwright_server.sh to verify
the installed standalone.js content in CI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a diagnostic Playwright test that captures:
- Server diagnostics (static_path, dependencies)
- Browser console logs (errors, warnings)
- Page content at 2s intervals (root text, AG Grid element counts)

Also fixes test_playwright_server.sh diagnostic to use importlib
instead of `import buckaroo` (which prints to stdout).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds direct WebSocket capture in the diagnostic test to log:
- mode, df_meta, df_display_args keys and data_key values
- df_data_dict structure, buckaroo_state

This will reveal if the server sends data_key='empty' instead
of 'main', which would cause AG Grid to use clientSide mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pd.api.types.is_numeric was removed in pandas 3.0. The get_mode()
function called it without arguments; fix to is_numeric_dtype(ser).
This was causing the entire analysis pipeline to fail silently in CI
(clean venv with pandas 3.0.1), leaving df_display_args with
data_key='empty' and preventing buckaroo mode from rendering data.

Also removes the temporary diagnostic test added for debugging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@paddymul
Copy link
Collaborator Author

@codex

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 432029d7af

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


prompt = f"Viewing {os.path.basename(path)}"
payload = json.dumps({"session": SESSION_ID, "path": path, "mode": "buckaroo", "prompt": prompt}).encode()
payload = json.dumps({"session": SESSION_ID, "path": path, "mode": "lazy", "prompt": prompt}).encode()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve JSON file support in MCP view mode

Switching the MCP load payload to "mode": "lazy" routes .json files through load_file_lazy, which uses pl.scan_ndjson for JSON inputs; that parser expects newline-delimited JSON and will reject common JSON array/object files that previously loaded via pandas in buckaroo mode. Since view_data/buckaroo_table are documented for JSON files, this change causes real user-visible failures for standard JSON datasets.

Useful? React with 👍 / 👎.

from .widget_utils import is_in_ipython, is_in_marimo, enable, disable, determine_jupter_env
from .dataflow.widget_extension_utils import DFViewer
_HAS_PANDAS = True
except ImportError:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restrict import fallback to missing pandas only

Catching a blanket ImportError here suppresses all widget-stack import failures, not just the intended “pandas not installed” case. If any transitive import regression (or missing core dependency) raises ImportError, buckaroo will silently set _HAS_PANDAS=False and disable notebook exports/initialization instead of surfacing the real error, which makes breakages much harder to detect and debug.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant