-
Notifications
You must be signed in to change notification settings - Fork 139
Description
Summary
I see a significant performance regression when many Vortex files are opened at the same time. This pattern occurs when a large table is on the hash-join probe side (the side scanned after the hash table is built) in DataFusion.
Comparing 0b20606 (2026-01-09) to current develop (df23c379), a benchmark with 40 Vortex files × 50K rows shows:
| Benchmark | Old (0b20606) |
Develop (df23c37) |
Regression |
|---|---|---|---|
| Scan-only (500× sequential) | 795ms | 916ms | +15% |
| Hash-join (500× sequential) | 1,281ms | 1,905ms | +49% |
The scan-only query opens files sequentially (DataFusion pipelined execution). The hash-join query opens all 40 probe-side files concurrently after the build side completes. Hash-join regresses much more than scan-only (+49% vs +15%), which suggests higher per-file open overhead under concurrent access.
Impact
In Spice, we accelerate tables into Vortex files split by row group (~150K rows each). A TPC-H SF1 lineitem table produces ~40 files.
Across 16 TPC-H queries, I measured a +156ms total regression (~31%). The regression is concentrated in queries where lineitem is a hash-join probe-side input (about +10ms to +18.5ms each).
Queries without this pattern are mostly flat (0–4ms), with small improvements in q3 and q1. The key factor appears to be burst concurrent opens of many file splits, not join complexity.
We did not observe a similar slowdown when reading from Parquet or in-memory Arrow records, which rules out DataFusion itself being the culprit. (We also upgraded from DF51 to DF52 when upgrading Vortex)
Reproducing
I created a self-contained benchmark example in vortex-datafusion. It creates 40 Vortex files, registers them as a ListingTable, then measures total time across 500 repeated queries (scan-only and hash-join):
cargo run -p vortex-datafusion --release --example concurrent_scan_benchThe benchmark is available on these branches in the spiceai/vortex fork:
phillip/260305-vortex-slowdown— on currentdevelop(df23c379)phillip/260305-vortex-slowdown-backport— on the old base
The benchmark is a single-file addition (vortex-datafusion/examples/concurrent_scan_bench.rs, 387 lines) with no dependency changes.
Incorrect analysis
Note: The following analysis is incorrect
Profiling analysis
I profiled the regression using macOS sample on the Spice runtime.
Disclaimer: I had Opus 4.6 do this analysis and have not independently verified it, so take its claims with a grain of salt (especially the % impact claims) - but I felt they were worth including.
1. spawn_blocking in LazyScanStream (~23% of added runtime)
PR #5906 moved ScanBuilder::prepare() into spawn_blocking. This avoids blocking for the single-large-file case (for example, ClickBench-style workloads). But it adds a thread-pool roundtrip per file when many small files are opened concurrently. With 40 files, that means 40 spawn_blocking calls that must be scheduled and completed.
2. execute() framework overhead (~40% of added runtime)
Related PRs: 5895, 5920, 5922, 5925, 6076, 6307.
These changes replaced the old direct RecordBatch::try_from(array) canonicalization path with a recursive operator execution tree (execute → swap child → re-run). For small arrays from small files, the per-array framework cost is proportionally high. In profiling, this appears as 205 samples in TreeNode::transform_up/down.
3. DashMap contention in VortexSession (~15% of added runtime)
PR 6000 introduced VortexSession with DashMap<TypeId, Box<dyn SessionVar>> and a Registry backed by Arc<DashMap<Id, T>>. Every array deserialize/execute call goes through these maps. Under 40 concurrent file opens, this creates lock contention:
- 107 samples in
parking_lot::RawMutex::lock_slow - 35 samples in
DashMap::lock_exclusive_slow - 181 samples in
Condvar::wait
Benchmark results (full output)
Develop (df23c379):
=== Scan-only: 500× SELECT SUM(value) FROM probe ===
Runs: [879, 884, 906, 926, 942, 951]
Median: 916ms total (1.83ms/query)
=== Hash join: 500× probe JOIN build ===
Runs: [1874, 1874, 1902, 1909, 1914, 1926]
Median: 1905ms total (3.81ms/query)
Old baseline (0b20606):
=== Scan-only: 500× SELECT SUM(value) FROM probe ===
Runs: [788, 791, 794, 796, 813, 815]
Median: 795ms total (1.59ms/query)
=== Hash join: 500× probe JOIN build ===
Runs: [1257, 1266, 1281, 1282, 1293, 1295]
Median: 1281ms total (2.56ms/query)
Run-to-run variance is ±2–3% within each version, so the 49% hash-join difference is well outside noise.
Environment
- Apple M4 Max, macOS
- All data on local SSD, warm OS page cache after first run
- Release builds with default optimization