refactor: switch transport from Parquet to Arrow IPC#549
refactor: switch transport from Parquet to Arrow IPC#549
Conversation
📦 TestPyPI package publishedpip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.9.dev22340155296or with uv: uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.9.dev22340155296MCP server for Claude Codeclaude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.12.9.dev22340155296" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table |
Replace fastparquet + hyparquet with pyarrow.ipc + @uwdata/flechette for binary data transport between Python and JS. Python: - serialization_utils.py: to_arrow_ipc() and sd_to_ipc_b64() replace fastparquet-based functions, using pyarrow.ipc streaming format - All callers updated (buckaroo_widget, polars_buckaroo, lazy_infinite, geopandas, dataflow, column_executor_dataflow, server/data_loading) - Backward compat aliases: to_parquet = to_arrow_ipc, etc. JS: - resolveDFData.ts: synchronous tableFromIPC() replaces async hyparquet - BuckarooWidgetInfinite.tsx: flechette for infinite scroll buffer parsing - widget.tsx, standalone.tsx, index.ts: updated imports and transcript recording Config: - pyproject.toml: removed fastparquet from core dependencies - package.json: replaced hyparquet with @uwdata/flechette Benefits: - Drops fastparquet dep (key enabler for making pandas optional) - Synchronous JS decoding (eliminates async/sync callback duality) - Backend-agnostic (works with both pandas and polars via pyarrow) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Guard buckaroo_widget/widget_utils/widget_extension_utils imports in __init__.py with try/except so `import buckaroo` works without pandas - Lazy-import pandas in server/data_loading.py so the server module can be imported in [mcp]-only environments (polars + pyarrow, no pandas) - Use TYPE_CHECKING guard in df_util.py to avoid top-level pandas import - Remove unused sd_to_ipc_b64 import from serialization_utils_test.py - Make session.py use Optional[Any] instead of Optional[pd.DataFrame] These changes enable the Server Playwright Tests to pass since the PR removes fastparquet (which transitively provided pandas) from core deps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7cf74a7 to
9ec5e7f
Compare
- smoke_test.py: base test no longer requires pandas since fastparquet was removed from core deps; pandas-specific checks run only if pandas is available - buckaroo_mcp_tool.py: use mode="lazy" (polars) instead of mode="buckaroo" (pandas) since [mcp] extras don't include pandas Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add pandas to [polars] extras since PolarsBuckarooWidget inherits from BuckarooWidget which requires pandas - Add pandas to server Playwright test venv (mode='buckaroo' needs it) - Fix JupyterLab Playwright test: use .first() for cell locators since synchronous IPC decoding now renders summary stats pinned rows immediately (where most_freq can match data values like 'Alice') Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
marimo_utils.py imports pandas at top level and BuckarooDataFrame inherits from pd.DataFrame, so pandas is a real dependency for the [marimo] extras group. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previous runs had Server Playwright cancelled due to concurrency group. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
writeTempParquet() was using pandas via `uv run python`, but in CI the server Playwright tests run in a clean [mcp] venv where `uv run` doesn't resolve to a python with pandas. Switch to polars (which is in the [mcp] extras) and use BUCKAROO_SERVER_PYTHON when available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
d014c1c to
a3a0406
Compare
The committed standalone.js was never rebuilt after switching from hyparquet to flechette. In CI, `python -m buckaroo.server` runs from the repo root, so the local buckaroo/ directory shadows the installed wheel — causing the server to serve the stale hyparquet-based bundle. Buckaroo-mode tests failed because the JS tried to decode Arrow IPC binary frames with the old parquet decoder. Also adds diagnostic logging to test_playwright_server.sh to verify the installed standalone.js content in CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a diagnostic Playwright test that captures: - Server diagnostics (static_path, dependencies) - Browser console logs (errors, warnings) - Page content at 2s intervals (root text, AG Grid element counts) Also fixes test_playwright_server.sh diagnostic to use importlib instead of `import buckaroo` (which prints to stdout). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds direct WebSocket capture in the diagnostic test to log: - mode, df_meta, df_display_args keys and data_key values - df_data_dict structure, buckaroo_state This will reveal if the server sends data_key='empty' instead of 'main', which would cause AG Grid to use clientSide mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pd.api.types.is_numeric was removed in pandas 3.0. The get_mode() function called it without arguments; fix to is_numeric_dtype(ser). This was causing the entire analysis pipeline to fail silently in CI (clean venv with pandas 3.0.1), leaving df_display_args with data_key='empty' and preventing buckaroo mode from rendering data. Also removes the temporary diagnostic test added for debugging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 432029d7af
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| prompt = f"Viewing {os.path.basename(path)}" | ||
| payload = json.dumps({"session": SESSION_ID, "path": path, "mode": "buckaroo", "prompt": prompt}).encode() | ||
| payload = json.dumps({"session": SESSION_ID, "path": path, "mode": "lazy", "prompt": prompt}).encode() |
There was a problem hiding this comment.
Preserve JSON file support in MCP view mode
Switching the MCP load payload to "mode": "lazy" routes .json files through load_file_lazy, which uses pl.scan_ndjson for JSON inputs; that parser expects newline-delimited JSON and will reject common JSON array/object files that previously loaded via pandas in buckaroo mode. Since view_data/buckaroo_table are documented for JSON files, this change causes real user-visible failures for standard JSON datasets.
Useful? React with 👍 / 👎.
| from .widget_utils import is_in_ipython, is_in_marimo, enable, disable, determine_jupter_env | ||
| from .dataflow.widget_extension_utils import DFViewer | ||
| _HAS_PANDAS = True | ||
| except ImportError: |
There was a problem hiding this comment.
Restrict import fallback to missing pandas only
Catching a blanket ImportError here suppresses all widget-stack import failures, not just the intended “pandas not installed” case. If any transitive import regression (or missing core dependency) raises ImportError, buckaroo will silently set _HAS_PANDAS=False and disable notebook exports/initialization instead of surfacing the real error, which makes breakages much harder to detect and debug.
Useful? React with 👍 / 👎.
Summary
Replace
fastparquet(Python) +hyparquet(JS) withpyarrow.ipc+@uwdata/flechettefor binary data transport between Python and JS.fastparquetfrom core dependencies — key enabler for making pandas optional and for WASM/Pyodide compatibilitytableFromIPC()eliminates the async/sync callback duality from hyparquetpa.Table.from_pandas()) and polars (df.to_arrow())to_parquet = to_arrow_ipc, etc.)Performance and bundle size analysis
JS bundle size impact
widget.js(Jupyter)The switch from hyparquet (~10 KB min+gz) to flechette (~14 KB min+gz) adds roughly +4 KB to the final widget bundle — negligible.
Python dependency impact
fastparquetcramjam(fastparquet transitive)pyarrowNet savings: ~12 MB removed from install footprint. More importantly,
cramjamis a compiled Rust extension with no WASM wheel, so removing it unblocks Pyodide/WASM deployment.Decoding architecture: why Arrow IPC is better for widget transport
Arrow IPC streaming format is fundamentally different from Parquet for this use case:
tableFromIPC(buf)— synchronousparquetRead({file, onComplete})— async onlyipc.new_stream()— trivial, Arrow-nativefastparquet.write()— compression + dictionary/RLE encodingThe wire size tradeoff doesn't matter here — data travels over a local Jupyter websocket or kernel comm channel, not the internet. The decode speed and API simplicity advantages dominate.
Concrete wins from synchronous decoding
The old hyparquet code path had a subtle bug surface area because
parquetReadfires itsonCompletecallback asynchronously in some bundler environments (esbuild standalone) but synchronously in others (webpack/Jupyter). This forced us to maintain bothresolveDFData()(sync, returns[]if callback hasn't fired yet) andresolveDFDataAsync()(async, wraps in Promise) plus apreResolveDFDataDict()pre-resolution step.With flechette,
tableFromIPC()is always synchronous.resolveDFDataAsyncnow trivially wraps the sync version, andpreResolveDFDataDictno longer needsPromise.all. This eliminates an entire class of timing bugs where summary stats would briefly render as empty arrays before the async decode completed.Row extraction performance
From flechette's published benchmarks (vs the official
apache-arrowJS library):table.toArray())We use
table.toArray()(row object extraction) on every infinite scroll response and every summary stats decode — this is the hot path. The 7–11x speedup overapache-arrowJS is comparing Arrow libraries; the advantage over hyparquet's Parquet decode path is even larger since Parquet decoding has additional decompression overhead on top of extraction.Files changed
Python (8 files)
serialization_utils.py:to_arrow_ipc()andsd_to_ipc_b64()replace fastparquet-based functionsbuckaroo_widget,polars_buckaroo,lazy_infinite_polars_widget,geopandas_buckaroo,dataflow/dataflow,dataflow/column_executor_dataflow,server/data_loadingJS/TS (7 files)
resolveDFData.ts: synchronoustableFromIPC()replaces asyncparquetReadBuckarooWidgetInfinite.tsx: flechette for infinite scroll buffer parsingDFWhole.ts: addedIpcB64Payloadtype to unionindex.ts,widget.tsx,standalone.tsx: updated imports and transcript recordingConfig
pyproject.toml: removedfastparquetfrom coredependenciespackage.json: replacedhyparquetwith@uwdata/flechetteTests (4 files)
to_arrow_ipc,_read_ipc_to_polars/_read_ipc_bufferhelpers_resolve_all_statsto handleipc_b64formatTest plan
uv run pytest tests/unit/serialization_utils_test.py -v— serialization round-tripuv run pytest tests/unit -v— 514 tests passcd packages/buckaroo-js-core && pnpm test— JS unit tests./scripts/full_build.sh🤖 Generated with Claude Code