Concurrent file-open performance regression in hash-join workloads

### Summary

I see a significant performance regression when many Vortex files are opened at the same time. This pattern occurs when a large table is on the hash-join probe side (the side scanned after the hash table is built) in DataFusion.

Comparing `0b20606` (2026-01-09) to current `develop` (`df23c379`), a benchmark with 40 Vortex files × 50K rows shows:

| Benchmark | Old (`0b20606`) | Develop (`df23c37`) | Regression |
|---|---|---|---|
| Scan-only (500× sequential) | 795ms | 916ms | **+15%** |
| Hash-join (500× sequential) | 1,281ms | 1,905ms | **+49%** |

The scan-only query opens files sequentially (DataFusion pipelined execution). The hash-join query opens all 40 probe-side files concurrently after the build side completes. Hash-join regresses much more than scan-only (+49% vs +15%), which suggests higher per-file open overhead under concurrent access.

<img width="1518" height="439" alt="Image" src="https://github.com/user-attachments/assets/4b67997d-9e46-48a5-a31b-719fbe272644" />

### Impact

In [Spice](https://github.com/spiceai/spiceai), we accelerate tables into Vortex files split by row group (~150K rows each). A TPC-H SF1 `lineitem` table produces ~40 files.

Across 16 TPC-H queries, I measured a **+156ms total regression (~31%)**. The regression is concentrated in queries where `lineitem` is a hash-join probe-side input (about +10ms to +18.5ms each).

Queries without this pattern are mostly flat (0–4ms), with small improvements in `q3` and `q1`. The key factor appears to be burst concurrent opens of many file splits, not join complexity.

We did not observe a similar slowdown when reading from Parquet or in-memory Arrow records, which rules out DataFusion itself being the culprit. (We also upgraded from DF51 to DF52 when upgrading Vortex)

### Reproducing

I created a self-contained benchmark example in `vortex-datafusion`. It creates 40 Vortex files, registers them as a `ListingTable`, then measures total time across 500 repeated queries (scan-only and hash-join):

```bash
cargo run -p vortex-datafusion --release --example concurrent_scan_bench
```

The benchmark is available on these branches in the [spiceai/vortex](https://github.com/spiceai/vortex) fork:

- [`phillip/260305-vortex-slowdown`](https://github.com/vortex-data/vortex/compare/develop...spiceai:vortex:phillip/260305-vortex-slowdown) — on current `develop` (`df23c379`)
- [`phillip/260305-vortex-slowdown-backport`](https://github.com/spiceai/vortex/commit/48430e13542601637e85d88418bb5ab224e2ae1f) — on the old base

The benchmark is a single-file addition (`vortex-datafusion/examples/concurrent_scan_bench.rs`, 387 lines) with no dependency changes.

<details>
  <summary>Incorrect analysis</summary>

Note: The following analysis is incorrect

### Profiling analysis

I profiled the regression using macOS `sample` on the Spice runtime. 

_Disclaimer: I had Opus 4.6 do this analysis and have not independently verified it, so take its claims with a grain of salt (especially the % impact claims) - but I felt they were worth including._

**1. `spawn_blocking` in `LazyScanStream` (~23% of added runtime)**

PR #5906 moved `ScanBuilder::prepare()` into `spawn_blocking`. This avoids blocking for the single-large-file case (for example, ClickBench-style workloads). But it adds a thread-pool roundtrip per file when many small files are opened concurrently. With 40 files, that means 40 `spawn_blocking` calls that must be scheduled and completed.

**2. `execute()` framework overhead (~40% of added runtime)**

Related PRs: 5895, 5920, 5922, 5925, 6076, 6307.

These changes replaced the old direct `RecordBatch::try_from(array)` canonicalization path with a recursive operator execution tree (`execute` → swap child → re-run). For small arrays from small files, the per-array framework cost is proportionally high. In profiling, this appears as 205 samples in `TreeNode::transform_up/down`.

**3. `DashMap` contention in `VortexSession` (~15% of added runtime)**

PR 6000 introduced `VortexSession` with `DashMap<TypeId, Box<dyn SessionVar>>` and a `Registry` backed by `Arc<DashMap<Id, T>>`. Every array deserialize/execute call goes through these maps. Under 40 concurrent file opens, this creates lock contention:

- 107 samples in `parking_lot::RawMutex::lock_slow`
- 35 samples in `DashMap::lock_exclusive_slow`
- 181 samples in `Condvar::wait`

</details>

### Benchmark results (full output)

**Develop (`df23c379`):**
```
=== Scan-only: 500× SELECT SUM(value) FROM probe ===
  Runs: [879, 884, 906, 926, 942, 951]
  Median: 916ms total (1.83ms/query)

=== Hash join: 500× probe JOIN build ===
  Runs: [1874, 1874, 1902, 1909, 1914, 1926]
  Median: 1905ms total (3.81ms/query)
```

**Old baseline (`0b20606`):**
```
=== Scan-only: 500× SELECT SUM(value) FROM probe ===
  Runs: [788, 791, 794, 796, 813, 815]
  Median: 795ms total (1.59ms/query)

=== Hash join: 500× probe JOIN build ===
  Runs: [1257, 1266, 1281, 1282, 1293, 1295]
  Median: 1281ms total (2.56ms/query)
```

Run-to-run variance is ±2–3% within each version, so the 49% hash-join difference is well outside noise.

### Environment

- Apple M4 Max, macOS
- All data on local SSD, warm OS page cache after first run
- Release builds with default optimization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent file-open performance regression in hash-join workloads #6789

Summary

Impact

Reproducing

Profiling analysis

Benchmark results (full output)

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark	Old (`0b20606`)	Develop (`df23c37`)	Regression
Scan-only (500× sequential)	795ms	916ms	+15%
Hash-join (500× sequential)	1,281ms	1,905ms	+49%

Concurrent file-open performance regression in hash-join workloads #6789

Description

Summary

Impact

Reproducing

Profiling analysis

Benchmark results (full output)

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions