Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
532b112
Integrate ScanNode V2 scan path
gatesn Jun 17, 2026
5e30a36
Add ScanNode split planning
gatesn Jun 17, 2026
9f50387
WIP checkpoint: V2 scan driver + scheduler (pre-agent-integration)
gatesn Jun 18, 2026
7379a1b
Fix V2 struct single-field projection dropping parent validity
gatesn Jun 18, 2026
12fb901
perf(scan-v2): order conjuncts by cost and evaluate residual filters …
gatesn Jun 18, 2026
abfe671
perf(scan-v2): distribute scan morsels across DataFusion partitions
gatesn Jun 18, 2026
426fd3d
bench(datafusion): dump per-operator annotated plan under VORTEX_BENC…
gatesn Jun 18, 2026
463e6b6
Add v2 layout vtable scan path
gatesn Jun 18, 2026
a000b1d
Optimize scan2 predicate evidence scheduling
gatesn Jun 19, 2026
45edf17
Optimize scan2 sparse dictionary reads
gatesn Jun 19, 2026
99dd46a
Widen sparse dict reads for filtered projections
gatesn Jun 19, 2026
84a4d7e
Avoid dense sparse dict compaction
gatesn Jun 19, 2026
28bc2b4
Share dict scan value caches
gatesn Jun 19, 2026
1ee55db
Share prepared scan state
gatesn Jun 20, 2026
3b0fcfa
Run PR benchmarks with scan2
gatesn Jun 20, 2026
2618394
Move scan plan runtime into vortex-scan
gatesn Jun 20, 2026
3ddb91b
Merge develop into layout27
gatesn Jun 20, 2026
ea9ef64
Fix struct binding
gatesn Jun 20, 2026
f7b42b7
Fix struct binding
gatesn Jun 20, 2026
962c099
Chunk DuckDB exports on chunk boundaries
gatesn Jun 20, 2026
45373cd
Optimize V2 scan stats and literals
gatesn Jun 21, 2026
2225c16
Fix struct binding
gatesn Jun 21, 2026
505f3b0
Merge remote-tracking branch 'origin/develop' into ngates/layout27
gatesn Jun 21, 2026
2615c19
Improve sparse OnPair scan projection
gatesn Jun 21, 2026
5c6e824
Reduce V2 duplicate scan requests
gatesn Jun 21, 2026
e557c0c
Simplify scan task read scheduling
gatesn Jun 22, 2026
33433bc
Port split planning to scan2
gatesn Jun 22, 2026
0b6e79e
Improve V2 scan partition scheduling
gatesn Jun 22, 2026
942ba84
Tune scan task scheduling
gatesn Jun 22, 2026
7910796
Merge remote-tracking branch 'origin/develop' into ngates/layout27
gatesn Jun 22, 2026
79ecea4
Tune scan2 dictionary projection
gatesn Jun 22, 2026
9e2a53e
Fix sparse byte-view Arrow export
gatesn Jun 23, 2026
40bd782
Merge origin/develop into ngates/layout27
gatesn Jun 23, 2026
f740e71
Fix assertion contexts after develop merge
gatesn Jun 23, 2026
000485c
Fix PolarSignals V2 scan regressions
gatesn Jun 23, 2026
80f0dda
Refactor scan tasks into continuation steps
gatesn Jun 24, 2026
41871ef
Merge remote-tracking branch 'origin/develop' into ngates/layout27
gatesn Jun 24, 2026
662e2b6
Fix CI failures after scan merge
gatesn Jun 24, 2026
0d1c52b
Remove unused scan2 session switch
gatesn Jun 24, 2026
ff63cb5
Prepare scan v2 APIs for merge
gatesn Jun 24, 2026
0cc1421
Clean up DataFusion scan config API
gatesn Jun 24, 2026
5376adf
Split layout v2 vtables into layouts_v2
gatesn Jun 24, 2026
e96e21d
Refactor ScanPlan read scheduling
gatesn Jun 24, 2026
aa6096e
Remove split encoding changes from layout27
gatesn Jun 24, 2026
2926420
Remove row-index sortedness validation from layout27
gatesn Jun 24, 2026
97042b6
Use session default scan scheduler
gatesn Jun 24, 2026
7d285c1
slqlogictest
gatesn Jun 24, 2026
2f37a1b
Remove DuckDB max cardinality from layout27
gatesn Jun 24, 2026
fb0d070
Remove varbinview compaction change from layout27
gatesn Jun 24, 2026
f4d8fa0
Remove Arrow export compaction change from layout27
gatesn Jun 24, 2026
62d88dc
Move scan plan API into vortex-scan
gatesn Jun 24, 2026
150a52c
Move scan plan runtime into vortex-scan
gatesn Jun 24, 2026
93f80af
Merge remote-tracking branch 'origin/develop' into ngates/layout27
gatesn Jun 25, 2026
04ded80
Default to ScanV2
gatesn Jun 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .agents/skills/bench-performance/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,14 @@ Do not wait for a deep code read before showing benchmark comparisons or first s
- engine/format target(s), for example `datafusion:vortex` versus `datafusion:parquet`;
- runtime environment toggles, if the branch exposes any.

If the checkout is an agent worktree, keep benchmark data in the canonical checkout cache rather
than downloading or generating it inside the worktree. Prefer a `file://` data URL that points at
`/Users/ngates/git/vortex/vortex-bench/data/...` (or the user's main checkout equivalent), for
example `--opt remote-data-dir=file:///Users/ngates/git/vortex/vortex-bench/data/clickbench_partitioned/`
when the benchmark supports `remote-data-dir`. For local-only suites such as `statpopgen`, run
from the main checkout or arrange the suite's `vortex-bench/data/<suite>/...` path to reuse that
canonical cache before generating data.

3. Run a small comparable benchmark through `vx-bench`:

```bash
Expand Down
6 changes: 6 additions & 0 deletions .github/workflows/sql-benchmarks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@ on:
required: false
type: string
default: i7i.metal-24xl
vortex_scan_impl:
required: false
type: string
default: ""
description: "Optional VORTEX_SCAN_IMPL override for Vortex file scans, e.g. v1 for legacy scans"
benchmark_matrix:
required: false
type: string
Expand Down Expand Up @@ -511,6 +516,7 @@ jobs:
bench:
timeout-minutes: 120
env:
VORTEX_SCAN_IMPL: ${{ inputs.vortex_scan_impl }}
VORTEX_EXPERIMENTAL_PATCHED_ARRAY: "1"
FLAT_LAYOUT_INLINE_ARRAY_NODE: "1"
# Makes python output nicer
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -245,3 +245,6 @@ vortex-python/.benchmarks/
# For local benchmarks website server and things like the WAL
**.duckdb*
.bench-env

.agents/worktrees/
.claude/worktrees/
13 changes: 13 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,19 @@ cargo +nightly fmt --all
cargo clippy --all-targets --all-features
```

Before pushing Rust changes, compile the relevant test targets, not only library targets. At
minimum, run `cargo test -p <crate-name> --all-features --no-run` for every touched Rust crate that
has tests. For cross-crate scan, layout, file, Arrow export, or execution-context changes, include
the crates that can compile hidden or feature-gated tests, for example:

```bash
cargo test -p vortex-array --all-features --no-run
cargo check -p vortex-layout -p vortex-file -p vortex-duckdb -p vortex-datafusion --all-features
```

Do not push after merge conflict resolution until the post-merge test-target build succeeds for the
affected crates.

Notes:

- For `.github/` changes, follow `.github/AGENTS.md` and run
Expand Down
5 changes: 5 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

52 changes: 46 additions & 6 deletions benchmarks/datafusion-bench/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,10 @@ use object_store::aws::AmazonS3Builder;
use object_store::gcp::GoogleCloudStorageBuilder;
use object_store::local::LocalFileSystem;
use url::Url;
use vortex::scan::ScanScheduler;
use vortex::scan::ScanSchedulerConfig;
use vortex::scan::ScanSchedulerSessionExt;
use vortex::session::VortexSession;
use vortex_bench::Format;
use vortex_bench::SESSION;
use vortex_datafusion::VortexFormat;
Expand All @@ -45,7 +49,11 @@ pub fn get_session_context() -> SessionContext {
.build_arc()
.expect("could not build runtime environment");

let factory = VortexFormatFactory::new().with_options(vortex_table_options());
let factory = VortexFormatFactory::new()
.with_session(
vortex_session_from_env().expect("invalid Vortex benchmark scan scheduler env"),
)
.with_options(vortex_table_options());

let mut session_state_builder = SessionStateBuilder::new()
.with_config(SessionConfig::from_env().expect("shouldn't fail"))
Expand Down Expand Up @@ -106,19 +114,51 @@ pub fn make_object_store(
}
}

pub fn format_to_df_format(format: Format) -> Arc<dyn FileFormat> {
match format {
pub fn format_to_df_format(format: Format) -> anyhow::Result<Arc<dyn FileFormat>> {
Ok(match format {
Format::Csv => Arc::new(CsvFormat::default()) as _,
Format::Arrow => Arc::new(ArrowFormat),
Format::Parquet => Arc::new(ParquetFormat::new()),
Format::OnDiskVortex | Format::VortexCompact => Arc::new(VortexFormat::new_with_options(
SESSION.clone(),
vortex_session_from_env()?,
vortex_table_options(),
)),
Format::OnDiskDuckDB | Format::Lance => {
unimplemented!("Format {format} cannot be turned into a DataFusion `FileFormat`")
anyhow::bail!("Format {format} cannot be turned into a DataFusion `FileFormat`")
}
}
})
}

fn vortex_session_from_env() -> anyhow::Result<VortexSession> {
let session = SESSION.clone();
let Ok(mode) = std::env::var("VORTEX_SCAN_SCHEDULER") else {
return Ok(session);
};
let config = scan_scheduler_config_from_env()?;
Ok(match mode.as_str() {
"unbounded" => session.with_unbounded_scan_scheduler(),
"shared" | "global" => session.with_scan_scheduler(Arc::new(ScanScheduler::new(config))),
"per-query" | "per-scan" => session.with_new_scan_scheduler_per_scan(config),
other => anyhow::bail!(
"Invalid VORTEX_SCAN_SCHEDULER={other}; expected unbounded, shared, or per-query"
),
})
}

fn scan_scheduler_config_from_env() -> anyhow::Result<ScanSchedulerConfig> {
let read_byte_budget = std::env::var("VORTEX_SCAN_MAX_READ_BYTES")
.ok()
.map(|value| {
value.parse::<u64>().map_err(|e| {
anyhow::anyhow!("invalid scan scheduler read byte budget {value}: {e}")
})
})
.transpose()?;

Ok(match read_byte_budget {
Some(bytes) => ScanSchedulerConfig::default().with_read_byte_budget(Some(bytes)),
None => ScanSchedulerConfig::default(),
})
}

fn vortex_table_options() -> VortexTableOptions {
Expand Down
120 changes: 37 additions & 83 deletions benchmarks/datafusion-bench/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -27,16 +27,13 @@ use datafusion_physical_plan::collect;
use futures::StreamExt;
use parking_lot::Mutex;
use tokio::fs::File;
use vortex::io::filesystem::FileSystemRef;
use vortex::scan::DataSourceRef;
use vortex_bench::Benchmark;
use vortex_bench::BenchmarkArg;
use vortex_bench::CompactionStrategy;
use vortex_bench::Engine;
use vortex_bench::Format;
use vortex_bench::Opt;
use vortex_bench::Opts;
use vortex_bench::SESSION;
use vortex_bench::conversions::convert_parquet_directory_to_vortex;
use vortex_bench::create_benchmark;
use vortex_bench::create_output_writer;
Expand Down Expand Up @@ -190,7 +187,7 @@ async fn main() -> anyhow::Result<()> {
async move {
let session = datafusion_bench::get_session_context();
datafusion_bench::make_object_store(&session, benchmark.data_url())?;
register_benchmark_tables(&session, benchmark, format).await?;
register_benchmark_tables(&session, benchmark, format, show_metrics).await?;
Ok((session, format))
}
},
Expand Down Expand Up @@ -246,99 +243,42 @@ async fn main() -> anyhow::Result<()> {
Ok(())
}

fn use_scan_api() -> bool {
std::env::var("VORTEX_USE_SCAN_API").is_ok_and(|v| v == "1")
}

async fn register_benchmark_tables<B: Benchmark + ?Sized>(
session: &SessionContext,
benchmark: &B,
format: Format,
_show_metrics: bool,
) -> anyhow::Result<()> {
match format {
Format::Arrow => register_arrow_tables(session, benchmark).await,
_ if use_scan_api() && matches!(format, Format::OnDiskVortex | Format::VortexCompact) => {
register_v2_tables(session, benchmark, format).await
}
_ => {
let benchmark_base = benchmark.data_url().join(&format!("{}/", format.name()))?;
let file_format = format_to_df_format(format);

for table in benchmark.table_specs().iter() {
let pattern = benchmark.pattern(table.name, format);
let table_url = ListingTableUrl::try_new(benchmark_base.clone(), pattern)?;

let listing_options = ListingOptions::new(Arc::clone(&file_format))
.with_session_config_options(session.state().config());
let mut config =
ListingTableConfig::new(table_url).with_listing_options(listing_options);

config = match table.schema.as_ref() {
Some(schema) => config.with_schema(Arc::new(schema.clone())),
None => config.infer_schema(&session.state()).await?,
};

let listing_table = Arc::new(
ListingTable::try_new(config)?.with_cache(
session
.runtime_env()
.cache_manager
.get_file_statistic_cache(),
),
);

session.register_table(table.name, listing_table)?;
}

Ok(())
}
if matches!(format, Format::Arrow) {
return register_arrow_tables(session, benchmark).await;
}
}

/// Register tables using the V2 `VortexTable` + `MultiFileDataSource` path.
async fn register_v2_tables<B: Benchmark + ?Sized>(
session: &SessionContext,
benchmark: &B,
format: Format,
) -> anyhow::Result<()> {
use vortex::file::multi::MultiFileDataSource;
use vortex::io::object_store::ObjectStoreFileSystem;
use vortex::io::session::RuntimeSessionExt;
use vortex::scan::DataSource as _;
use vortex_datafusion::v2::VortexTable;

let benchmark_base = benchmark.data_url().join(&format!("{}/", format.name()))?;
let file_format = format_to_df_format(format)?;

for table in benchmark.table_specs().iter() {
let pattern = benchmark.pattern(table.name, format);
let table_url = ListingTableUrl::try_new(benchmark_base.clone(), pattern.clone())?;
let store = session
.state()
.runtime_env()
.object_store(table_url.object_store())?;

let fs: FileSystemRef = Arc::new(ObjectStoreFileSystem::new(
Arc::clone(&store),
SESSION.handle(),
));
let base_prefix = benchmark_base.path().trim_start_matches('/').to_string();
let fs = fs.with_prefix(base_prefix);

let glob_pattern = match &pattern {
Some(p) => p.as_str().to_string(),
None => format!("*.{}", format.ext()),
};
let table_url = ListingTableUrl::try_new(benchmark_base.clone(), pattern)?;

let multi_ds = MultiFileDataSource::new(SESSION.clone())
.with_glob(glob_pattern, Some(fs))
.build()
.await?;
let listing_options = ListingOptions::new(Arc::clone(&file_format))
.with_session_config_options(session.state().config());
let mut config = ListingTableConfig::new(table_url).with_listing_options(listing_options);

let arrow_schema = Arc::new(multi_ds.dtype().to_arrow_schema()?);
let data_source: DataSourceRef = Arc::new(multi_ds);
config = match table.schema.as_ref() {
Some(schema) => config.with_schema(Arc::new(schema.clone())),
None => config.infer_schema(&session.state()).await?,
};

let listing_table = Arc::new(
ListingTable::try_new(config)?.with_cache(
session
.runtime_env()
.cache_manager
.get_file_statistic_cache(),
),
);

let table_provider = Arc::new(VortexTable::new(data_source, SESSION.clone(), arrow_schema));
session.register_table(table.name, table_provider)?;
session.register_table(table.name, listing_table)?;
}

Ok(())
Expand Down Expand Up @@ -439,7 +379,21 @@ pub async fn execute_query(

/// Print Vortex metrics from execution plans.
fn print_metrics(plans: &[(usize, Format, Arc<dyn ExecutionPlan>)]) {
// VORTEX_BENCH_FULL_PLAN=1 dumps the full per-operator annotated plan (DataFusion
// EXPLAIN ANALYZE-style: elapsed_compute / output_rows per operator), to localize where
// wall time goes (scan vs HashJoin build/probe vs aggregate).
let full_plan = std::env::var_os("VORTEX_BENCH_FULL_PLAN").is_some();
for (query_idx, format, plan) in plans {
if full_plan {
eprintln!("=== annotated plan query={query_idx}, {format} ===");
eprintln!(
"{}",
datafusion_physical_plan::display::DisplayableExecutionPlan::with_metrics(
plan.as_ref()
)
.indent(true)
);
}
let metric_sets = VortexMetricsFinder::find_all(plan.as_ref());
if metric_sets.is_empty() {
continue;
Expand Down
2 changes: 1 addition & 1 deletion docs/concepts/file-format.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ The writer accepts a stream of Vortex arrays, applies a layout strategy to organ
and serializes the layout and its segments into a single file.

The bulk of the file format specification describes the representation of the footer bytes such that the
layout tree can be reconstructed for scans.
layout tree can be reconstructed and expanded into scan plans.

See the [Vortex File Format Specification](../specs/file-format.md) for full details.

Expand Down
Loading
Loading