Skip to content

DuckDB MultiFile Reader#7848

Draft
gatesn wants to merge 9 commits intodevelopfrom
ngates/duckdb-multi
Draft

DuckDB MultiFile Reader#7848
gatesn wants to merge 9 commits intodevelopfrom
ngates/duckdb-multi

Conversation

@gatesn
Copy link
Copy Markdown
Contributor

@gatesn gatesn commented May 8, 2026

Putting this up to start the discussion.

This PR exposes DuckDB's multi-file reader machinery to Rust, along with a Vortex implementation.

It is largely slower than our current implementation, but for a few reasons:

  • There is more constant overhead on the DuckDB side including e.g. file mutexes etc.
  • DuckDB has stricter order preservation... We run one file per worker thread, essentially interleaving batches from multiple files. DuckDB returns batches in order of {file_idx, batch_idx}.

There are a couple of queries that are faster, this is typically due to per-file statistics.

The reason I was experimenting with this is because DuckLake is internally implemented using the MultiFileReader interface, so would be the easiest way to add Vortex DuckLake support.

We may want to pull out parts of this PR:

  • Expose DuckDB object cache for footer caching, reports exact stats when footers exist in the cache.
  • There's a fix that makes the FileSystem hold onto the ClientCtx to avoid SIGSEGV for concurrent tests
  • There's some code to rewrite row_idx() filters into RowSelection masks prior to constructing the layout reader tree. This can short-circuit a lot of overheads for very highly selective queries.

gatesn added 9 commits May 6, 2026 20:39
Per-column stats access by struct field name. Used by the upcoming
MultiFileFunction-backed DuckDB scan, where DuckDB's BaseFileReader::
GetStatistics is keyed by name rather than index.

Signed-off-by: Nicholas Gates <nick@spiraldb.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
…nction<OP>

Wraps DuckDB's templated MultiFileFunction machinery so an extension can
plug in a per-format reader and inherit cross-file orchestration (file
globbing, virtual columns, hive partitioning, COPY support) for free.

Layered:
  - cpp/include/duckdb_vx/multi_file_function.h — C-ABI vtable
  - cpp/multi_file_function.cpp — VortexMultiFileReaderInterface and
    VortexFileReader subclass MultiFileReaderInterface and BaseFileReader,
    forwarding virtual calls to the FFI vtable
  - src/duckdb/multi_file_function.rs — Rust MultiFileFunction +
    BaseFileReader traits with associated types (mirroring the existing
    TableFunction shape) and a register_multi_file_function method on
    DatabaseRef

This commit does not yet register a function — the next commit adds
VortexMultiFileFunction and wires it up.

Signed-off-by: Nicholas Gates <nick@spiraldb.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Concrete MultiFileFunction implementation that drives a per-file scan via
VortexFile directly (rather than going through MultiLayoutDataSource),
making file-level statistics, dtype, and pruning available without
LayoutReader downcasts.

Registration:
  - read_vortex_v2 is always registered for direct comparison.
  - VX_DUCKDB_MULTI_FILE_FUNCTION=1 swaps read_vortex / vortex_scan over to
    the v2 path so existing benchmarks and SQL can run unchanged.

Smoke tests cover single-file, strings, and multi-file glob.

Known v2 gaps vs the existing TableFunction-backed scan (documented on
use_multi_file_function): no projection or filter pushdown; no Vortex
filesystem integration; no list-of-paths overload; no union_by_name. The
v2 path is intended for benchmarking the orchestration layer first;
parity work is follow-up.

Signed-off-by: Nicholas Gates <nick@spiraldb.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Brings the MultiFileFunction-backed scan up to feature parity with the
existing TableFunction path on the test suite (189/189 pass under both
when VX_DUCKDB_MULTI_FILE_FUNCTION=1).

FFI surface additions on the multi-file vtable:
  - prepare_reader(reader, projection, filters): called once after
    create_reader, hands the per-file reader the columns DuckDB wants
    (in chunk order) plus the pushed-down TableFilterSet.
  - cardinality(bind_data, file_count): row-count estimate for the
    optimizer; falls back to DuckDB's default when not provided.
  - to_string(bind_data, map): bind-time EXPLAIN key/value output.

C++ adapter overrides BaseFileReader::PrepareReader to translate
column_ids -> projected column names and forward filters; overrides
MultiFileReaderInterface::GetCardinality and TableFunction::to_string
to delegate to the new FFI callbacks.

In the Rust trait MultiFileFunction picks up `cardinality` and
`to_string` defaults; BaseFileReader picks up `prepare_reader` with a
default no-op so existing impls don't break.

VortexMultiFileFunction wires it together:
  - projection: builds a `select(names, root())` Vortex projection so
    chunks contain exactly the columns DuckDB expects (also handles
    SELECT count(*) which is the explicit zero-projection case).
  - filter: converts each TableFilter via try_from_table_filter and
    AND-collects into the scan filter.
  - file-level pruning: VortexFile::can_prune against the combined
    filter; pruned files short-circuit TryInitializeScan with false.
  - progress: rows_scanned / file.row_count() in [0, 100].
  - cardinality: rough APPROX_ROWS_PER_FILE * file_count estimate
    (bind_data can't be mutated from bind_reader through the current
    FFI; better numbers wait on that hop being added).
  - to_string: emits a "Function" row.

Multi-file orchestration:
  - register through MultiFileReader::CreateFunctionSet so the function
    set includes both the single-VARCHAR and LIST(VARCHAR) overloads.
    `read_vortex_v2(['a.vortex','b.vortex'])` works.
  - file IO routes through resolve_filesystem(base_url, ctx), so the
    `vortex_filesystem` extension option ('vortex' vs 'duckdb') chooses
    the same backends as the v1 path. HTTP/S3 work via DuckDB's httpfs
    when 'duckdb' is selected.

Late materialization remains intentionally off until the per-file
reader supports AddVirtualColumn for file_index / file_row_number;
batch parallelism within a file is also a follow-up (TryInitializeScan
is still one-shot per reader).

Signed-off-by: Nicholas Gates <nick@spiraldb.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant