Skip to content

Consolidate scan tuning params into one IcebergPerfConfig (by-value)#108

Open
hall-alex wants to merge 2 commits into
mainfrom
mah-ice-params-rusty
Open

Consolidate scan tuning params into one IcebergPerfConfig (by-value)#108
hall-alex wants to merge 2 commits into
mainfrom
mah-ice-params-rusty

Conversation

@hall-alex

@hall-alex hall-alex commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

RAI-50776

Consolidate all Iceberg scan tuning parameters into a single IcebergPerfConfig that is the one authoritative home for every default, passed by value into new_scan / new_incremental_scan at construction. Removes the per-knob with_*! setters, the Option/0 = auto hidden defaults, the dead data_file_concurrency_limit, and the hard-coded per-file buffer constants — a forgotten knob is now a compile error, not a silent wrong default.

⚠️ Breaking change (ABI + Julia API). Paired with raicode PR RelationalAI/raicode#27966. Must be published + merged in the order below.

🔀 How to merge (do these in order)

The dependency chain is raicode → RustyIceberg.jl (Julia, git-rev pinned) → iceberg_rust_ffi_jll (the Rust dylib). Because this is an ABI break, the JLL must be published first so this PR's test-julia CI can resolve iceberg_rust_ffi_jll = "0.9" and go green.

  1. Publish the new JLL first — open a Yggdrasil PR for iceberg_rust_ffi at v0.9.0 with the GitSource pointing at this branch's pushed commit (5493fe4). The Rust crate is fetched by commit SHA, so this does not require this PR to be merged — it can (and should) be done now. Pattern: Yggdrasil#13808 (Robert's 0.8.0), Yggdrasil#13839 (Gerald's bump).
  2. This PR's CI goes green once iceberg_rust_ffi_jll 0.9.x is registered (the [compat] here requires "0.9"). Until then, test-julia-* correctly fails with Unsatisfiable requirements … iceberg_rust_ffi_jll. ✅ Existing consumers are unaffected meanwhile: raicode master pins RustyIceberg by an exact git-rev in its Manifest.toml lockfile, so it keeps resolving the old 0.8.3 + old iceberg_rust_ffi_jll 0.8.x.
  3. Merge this PR, then trigger the RustyIceberg.jl 0.9.0 release (open the merge commit and comment to kick off the registry/TagBot flow — see this repo's README → Release). Version is already bumped to 0.9.0 here.
  4. Then the raicode side (RelationalAI/raicode#27966) can be merged: it re-resolves its Manifest.toml to this new rev + iceberg_rust_ffi_jll 0.9.x and starts using the new API. See that PR for its own steps.

Prior art for this exact 3-repo dance (RustyIceberg.jl → Yggdrasil → raicode): the Arrow C Data Interface breaking change — RustyIceberg.jl#100Yggdrasil#13808raicode#27727.

Summary

  • One config, owned here. IcebergPerfConfig (Julia, scan_common.jl) holds every tuning default with the full tuning docstring; raicode re-exports it and no longer defines its own. Knobs: batch_size, manifest_file_concurrency_limit, manifest_entry_concurrency_limit, file_prefetch_depth, serialization_concurrency_limit, plus the per-file output-buffer limits max_buffered_bytes_per_task, max_prefetch_buffers_of_waiting_file, max_prefetch_buffers_of_active_file.
  • By-value construction. new_scan(table, perf) / new_incremental_scan(table, from, to, perf) take the config and apply the manifest/batch knobs to the iceberg-rs builder at construction. Deleted: with_batch_size!, with_file_prefetch_depth!, with_serialization_concurrency_limit!, with_manifest_*_concurrency_limit!, and the with_data_file_concurrency_limit! no-ops (+ their exports).
  • Per-file buffer limits are now per-scan too (Robert's review): the former hard-coded MAX_BUFFERED_BYTES_PER_TASK / MAX_PREFETCH_BUFFERS_OF_{WAITING,ACTIVE}_FILE constants are gone; the values are carried per-scan in a BufferLimits struct on the scan and threaded through create_nested_pipeline / spawn_file_task to size the byte semaphore, the waiting/active slot budgets, and the per-file mpsc capacity. Defaults (100 MiB / 1 / 8) live only in the Julia IcebergPerfConfig.
  • Rust FFI: new #[repr(C)] IcebergPerfConfigFFI (8× u64, no Default) as the by-value wire struct; batch_size: Option<usize>usize; removed the dead file_concurrency field and every 0 = auto (cpu_count) fallback — values are used verbatim. Dead FFI setters/macros removed.
  • Env-var escape hatch. const rust_lib now honors ENV["ICEBERG_RUST_FFI_LIB"] (mirrors RustyObjectStore's OBJECT_STORE_LIB) so local/unreleased dylibs load in subprocesses such as ReTestItems workers, which run in a temp test env that does not inherit a LocalPreferences.toml override.
  • Versions: iceberg_rust_ffi crate + RustyIceberg.jl0.9.0; [compat] iceberg_rust_ffi_jll = "0.9".

Organized as two commits: the consolidation, then the per-file buffer-limit knobs (+ the cargo fmt/clippy fixes from the first CI run).

Test plan

  • cargo test — crate compiles; perf_config_abi_tests (struct size/align/offsets, 8× u64) passes; cargo fmt --check and cargo clippy --all-targets -- -D warnings clean.
  • test/perf_config_tests.jl — server-free binary-compat round-trip (isbits, sizeof == 8 × u64, iceberg_perf_config_roundtrip echo with 8 distinct sentinels) verifies the Julia ↔ Rust struct layout agree.
  • test/scan_tests.jl migrated to the new construction API.
  • Verified end-to-end from raicode (iceberg load unit suite, 131377/131377) against this branch via ENV["ICEBERG_RUST_FFI_LIB"].
  • test-julia-* CI — green once iceberg_rust_ffi_jll 0.9.x is published (step 1).

🤖 Generated with Claude Code

Replace the per-knob `with_*!` scan setters with a single `IcebergPerfConfig`
struct passed by value into `new_scan` / `new_incremental_scan` at construction.
This makes the config the single authoritative home for every tuning default and
turns a forgotten knob into a compile error rather than a runtime/silent-default
bug.

Julia side:
- Add `IcebergPerfConfig` (UInt64 fields, @kwdef defaults, full tuning docstring)
  in scan_common.jl; export it. This is now the one place defaults live.
- `new_scan(table, perf)` / `new_incremental_scan(table, from, to, perf)` take the
  config by value over the FFI; delete `with_batch_size!`,
  `with_file_prefetch_depth!`, `with_serialization_concurrency_limit!`,
  `with_manifest_*_concurrency_limit!`, and the `with_data_file_concurrency_limit!`
  no-ops (plus their exports).
- Add an `ENV["ICEBERG_RUST_FFI_LIB"]` escape hatch on `const rust_lib` (mirrors
  RustyObjectStore's `OBJECT_STORE_LIB`) so local/unreleased dylibs load in
  subprocesses such as ReTestItems workers, which run in a temp test env that does
  not inherit a `LocalPreferences.toml` override.

Rust FFI side:
- Add `#[repr(C)] IcebergPerfConfigFFI` (5x u64, no Default) as the by-value wire
  struct; `iceberg_new_scan` / `iceberg_new_incremental_scan` take it and apply the
  manifest/batch knobs to the iceberg-rs builder, storing the rest on the scan.
- Drop `batch_size: Option<usize>` -> `usize`, remove the dead `file_concurrency`
  field, and remove all `0 = auto` (cpu_count) fallbacks; values are now used
  verbatim from the config.
- Delete the dead `with_*` FFI setters/macros and the `data_file_concurrency_limit`
  no-ops.

Tests:
- Add perf_config_tests.jl: a server-free binary-compatibility check (isbits,
  sizeof, and an `iceberg_perf_config_roundtrip` FFI echo with distinct sentinels)
  plus a Rust `cargo test` asserting struct size/align/offsets.
- Migrate scan_tests.jl to the new construction API.

Versions: bump iceberg_rust_ffi crate and RustyIceberg.jl to 0.9.0 (ABI break),
and require `iceberg_rust_ffi_jll = "0.9"`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hall-alex hall-alex requested a review from robertbuessow June 25, 2026 09:30

@robertbuessow robertbuessow left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about MAX_BUFFERED_BYTES_PER_TASK and MAX_PREFETCH_BUFFERS_OF_WAITING_FILE

@hall-alex hall-alex closed this Jun 26, 2026
@hall-alex hall-alex reopened this Jun 26, 2026
@hall-alex

Copy link
Copy Markdown
Collaborator Author

How about MAX_BUFFERED_BYTES_PER_TASK and MAX_PREFETCH_BUFFERS_OF_WAITING_FILE

@robertbuessow, good idea! Having Claude add these do the struct as well.

Follow-up to Robert's review on #108: fold the three previously hard-coded
per-file output-buffer constants into the per-scan `IcebergPerfConfig` so they
join the other tuning knobs as caller-supplied, single-source-of-truth defaults.

- `max_buffered_bytes_per_task`, `max_prefetch_buffers_of_waiting_file`,
  `max_prefetch_buffers_of_active_file` are added to `IcebergPerfConfig` (Julia,
  defaults 100 MiB / 1 / 8) and to the `IcebergPerfConfigFFI` wire struct (now
  8 × u64).
- Rust: the `MAX_*` consts are deleted; the values are carried per-scan in a new
  `BufferLimits` struct on `IcebergScan` / `IcebergIncrementalScan` and threaded
  through `create_nested_pipeline` / `spawn_file_task` to size the byte semaphore,
  the waiting/active slot budgets, and the per-file mpsc capacity. The stats
  display reads the per-scan byte budget from `STATS` instead of the dead const.
- Also fixes the two CI lint failures on #108: `cargo fmt` and a `cargo clippy`
  `-D warnings` (empty-line-after-doc-comment + items-after-test-module in
  scan_common.rs).
- perf_config_tests.jl round-trip and the Rust ABI test extended to 8 fields;
  pipeline-unification.md updated (the "expose these as FFI-tunable knobs" future-
  work item is now done).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants