Conversation
Per-column stats access by struct field name. Used by the upcoming MultiFileFunction-backed DuckDB scan, where DuckDB's BaseFileReader:: GetStatistics is keyed by name rather than index. Signed-off-by: Nicholas Gates <nick@spiraldb.com> Signed-off-by: Nicholas Gates <nick@nickgates.com>
…nction<OP>
Wraps DuckDB's templated MultiFileFunction machinery so an extension can
plug in a per-format reader and inherit cross-file orchestration (file
globbing, virtual columns, hive partitioning, COPY support) for free.
Layered:
- cpp/include/duckdb_vx/multi_file_function.h — C-ABI vtable
- cpp/multi_file_function.cpp — VortexMultiFileReaderInterface and
VortexFileReader subclass MultiFileReaderInterface and BaseFileReader,
forwarding virtual calls to the FFI vtable
- src/duckdb/multi_file_function.rs — Rust MultiFileFunction +
BaseFileReader traits with associated types (mirroring the existing
TableFunction shape) and a register_multi_file_function method on
DatabaseRef
This commit does not yet register a function — the next commit adds
VortexMultiFileFunction and wires it up.
Signed-off-by: Nicholas Gates <nick@spiraldb.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Concrete MultiFileFunction implementation that drives a per-file scan via
VortexFile directly (rather than going through MultiLayoutDataSource),
making file-level statistics, dtype, and pruning available without
LayoutReader downcasts.
Registration:
- read_vortex_v2 is always registered for direct comparison.
- VX_DUCKDB_MULTI_FILE_FUNCTION=1 swaps read_vortex / vortex_scan over to
the v2 path so existing benchmarks and SQL can run unchanged.
Smoke tests cover single-file, strings, and multi-file glob.
Known v2 gaps vs the existing TableFunction-backed scan (documented on
use_multi_file_function): no projection or filter pushdown; no Vortex
filesystem integration; no list-of-paths overload; no union_by_name. The
v2 path is intended for benchmarking the orchestration layer first;
parity work is follow-up.
Signed-off-by: Nicholas Gates <nick@spiraldb.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Brings the MultiFileFunction-backed scan up to feature parity with the
existing TableFunction path on the test suite (189/189 pass under both
when VX_DUCKDB_MULTI_FILE_FUNCTION=1).
FFI surface additions on the multi-file vtable:
- prepare_reader(reader, projection, filters): called once after
create_reader, hands the per-file reader the columns DuckDB wants
(in chunk order) plus the pushed-down TableFilterSet.
- cardinality(bind_data, file_count): row-count estimate for the
optimizer; falls back to DuckDB's default when not provided.
- to_string(bind_data, map): bind-time EXPLAIN key/value output.
C++ adapter overrides BaseFileReader::PrepareReader to translate
column_ids -> projected column names and forward filters; overrides
MultiFileReaderInterface::GetCardinality and TableFunction::to_string
to delegate to the new FFI callbacks.
In the Rust trait MultiFileFunction picks up `cardinality` and
`to_string` defaults; BaseFileReader picks up `prepare_reader` with a
default no-op so existing impls don't break.
VortexMultiFileFunction wires it together:
- projection: builds a `select(names, root())` Vortex projection so
chunks contain exactly the columns DuckDB expects (also handles
SELECT count(*) which is the explicit zero-projection case).
- filter: converts each TableFilter via try_from_table_filter and
AND-collects into the scan filter.
- file-level pruning: VortexFile::can_prune against the combined
filter; pruned files short-circuit TryInitializeScan with false.
- progress: rows_scanned / file.row_count() in [0, 100].
- cardinality: rough APPROX_ROWS_PER_FILE * file_count estimate
(bind_data can't be mutated from bind_reader through the current
FFI; better numbers wait on that hop being added).
- to_string: emits a "Function" row.
Multi-file orchestration:
- register through MultiFileReader::CreateFunctionSet so the function
set includes both the single-VARCHAR and LIST(VARCHAR) overloads.
`read_vortex_v2(['a.vortex','b.vortex'])` works.
- file IO routes through resolve_filesystem(base_url, ctx), so the
`vortex_filesystem` extension option ('vortex' vs 'duckdb') chooses
the same backends as the v1 path. HTTP/S3 work via DuckDB's httpfs
when 'duckdb' is selected.
Late materialization remains intentionally off until the per-file
reader supports AddVirtualColumn for file_index / file_row_number;
batch parallelism within a file is also a follow-up (TryInitializeScan
is still one-shot per reader).
Signed-off-by: Nicholas Gates <nick@spiraldb.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Putting this up to start the discussion.
This PR exposes DuckDB's multi-file reader machinery to Rust, along with a Vortex implementation.
It is largely slower than our current implementation, but for a few reasons:
There are a couple of queries that are faster, this is typically due to per-file statistics.
The reason I was experimenting with this is because DuckLake is internally implemented using the MultiFileReader interface, so would be the easiest way to add Vortex DuckLake support.
We may want to pull out parts of this PR: