DuckDB MultiFile Reader by gatesn · Pull Request #7848 · vortex-data/vortex

gatesn · 2026-05-08T18:08:23Z

Putting this up to start the discussion.

This PR exposes DuckDB's multi-file reader machinery to Rust, along with a Vortex implementation.

It is largely slower than our current implementation, but for a few reasons:

There is more constant overhead on the DuckDB side including e.g. file mutexes etc.
DuckDB has stricter order preservation... We run one file per worker thread, essentially interleaving batches from multiple files. DuckDB returns batches in order of {file_idx, batch_idx}.

There are a couple of queries that are faster, this is typically due to per-file statistics.

The reason I was experimenting with this is because DuckLake is internally implemented using the MultiFileReader interface, so would be the easiest way to add Vortex DuckLake support.

We may want to pull out parts of this PR:

Expose DuckDB object cache for footer caching, reports exact stats when footers exist in the cache.
There's a fix that makes the FileSystem hold onto the ClientCtx to avoid SIGSEGV for concurrent tests
There's some code to rewrite row_idx() filters into RowSelection masks prior to constructing the layout reader tree. This can short-circuit a lot of overheads for very highly selective queries.

Per-column stats access by struct field name. Used by the upcoming MultiFileFunction-backed DuckDB scan, where DuckDB's BaseFileReader:: GetStatistics is keyed by name rather than index. Signed-off-by: Nicholas Gates <nick@spiraldb.com> Signed-off-by: Nicholas Gates <nick@nickgates.com>

…nction<OP> Wraps DuckDB's templated MultiFileFunction machinery so an extension can plug in a per-format reader and inherit cross-file orchestration (file globbing, virtual columns, hive partitioning, COPY support) for free. Layered: - cpp/include/duckdb_vx/multi_file_function.h — C-ABI vtable - cpp/multi_file_function.cpp — VortexMultiFileReaderInterface and VortexFileReader subclass MultiFileReaderInterface and BaseFileReader, forwarding virtual calls to the FFI vtable - src/duckdb/multi_file_function.rs — Rust MultiFileFunction + BaseFileReader traits with associated types (mirroring the existing TableFunction shape) and a register_multi_file_function method on DatabaseRef This commit does not yet register a function — the next commit adds VortexMultiFileFunction and wires it up. Signed-off-by: Nicholas Gates <nick@spiraldb.com> Signed-off-by: Nicholas Gates <nick@nickgates.com>

Concrete MultiFileFunction implementation that drives a per-file scan via VortexFile directly (rather than going through MultiLayoutDataSource), making file-level statistics, dtype, and pruning available without LayoutReader downcasts. Registration: - read_vortex_v2 is always registered for direct comparison. - VX_DUCKDB_MULTI_FILE_FUNCTION=1 swaps read_vortex / vortex_scan over to the v2 path so existing benchmarks and SQL can run unchanged. Smoke tests cover single-file, strings, and multi-file glob. Known v2 gaps vs the existing TableFunction-backed scan (documented on use_multi_file_function): no projection or filter pushdown; no Vortex filesystem integration; no list-of-paths overload; no union_by_name. The v2 path is intended for benchmarking the orchestration layer first; parity work is follow-up. Signed-off-by: Nicholas Gates <nick@spiraldb.com> Signed-off-by: Nicholas Gates <nick@nickgates.com>

Brings the MultiFileFunction-backed scan up to feature parity with the existing TableFunction path on the test suite (189/189 pass under both when VX_DUCKDB_MULTI_FILE_FUNCTION=1). FFI surface additions on the multi-file vtable: - prepare_reader(reader, projection, filters): called once after create_reader, hands the per-file reader the columns DuckDB wants (in chunk order) plus the pushed-down TableFilterSet. - cardinality(bind_data, file_count): row-count estimate for the optimizer; falls back to DuckDB's default when not provided. - to_string(bind_data, map): bind-time EXPLAIN key/value output. C++ adapter overrides BaseFileReader::PrepareReader to translate column_ids -> projected column names and forward filters; overrides MultiFileReaderInterface::GetCardinality and TableFunction::to_string to delegate to the new FFI callbacks. In the Rust trait MultiFileFunction picks up `cardinality` and `to_string` defaults; BaseFileReader picks up `prepare_reader` with a default no-op so existing impls don't break. VortexMultiFileFunction wires it together: - projection: builds a `select(names, root())` Vortex projection so chunks contain exactly the columns DuckDB expects (also handles SELECT count(*) which is the explicit zero-projection case). - filter: converts each TableFilter via try_from_table_filter and AND-collects into the scan filter. - file-level pruning: VortexFile::can_prune against the combined filter; pruned files short-circuit TryInitializeScan with false. - progress: rows_scanned / file.row_count() in [0, 100]. - cardinality: rough APPROX_ROWS_PER_FILE * file_count estimate (bind_data can't be mutated from bind_reader through the current FFI; better numbers wait on that hop being added). - to_string: emits a "Function" row. Multi-file orchestration: - register through MultiFileReader::CreateFunctionSet so the function set includes both the single-VARCHAR and LIST(VARCHAR) overloads. `read_vortex_v2(['a.vortex','b.vortex'])` works. - file IO routes through resolve_filesystem(base_url, ctx), so the `vortex_filesystem` extension option ('vortex' vs 'duckdb') chooses the same backends as the v1 path. HTTP/S3 work via DuckDB's httpfs when 'duckdb' is selected. Late materialization remains intentionally off until the per-file reader supports AddVirtualColumn for file_index / file_row_number; batch parallelism within a file is also a follow-up (TryInitializeScan is still one-shot per reader). Signed-off-by: Nicholas Gates <nick@spiraldb.com> Signed-off-by: Nicholas Gates <nick@nickgates.com>

Signed-off-by: Nicholas Gates <nick@nickgates.com>

gatesn added 9 commits May 6, 2026 20:39

More

f39f589

Signed-off-by: Nicholas Gates <nick@nickgates.com>

More

bfa2a5a

Signed-off-by: Nicholas Gates <nick@nickgates.com>

More

a7b10b2

Signed-off-by: Nicholas Gates <nick@nickgates.com>

More

9ef8afd

Signed-off-by: Nicholas Gates <nick@nickgates.com>

More

54e8da0

Signed-off-by: Nicholas Gates <nick@nickgates.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DuckDB MultiFile Reader#7848

DuckDB MultiFile Reader#7848
gatesn wants to merge 9 commits intodevelopfrom
ngates/duckdb-multi

gatesn commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gatesn commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gatesn commented May 8, 2026 •

edited

Loading