Skip to content

Add FlatLayout range read for sub-segment IO#6974

Open
jiaqizho wants to merge 1 commit intovortex-data:developfrom
jiaqizho:support-rangeread
Open

Add FlatLayout range read for sub-segment IO#6974
jiaqizho wants to merge 1 commit intovortex-data:developfrom
jiaqizho:support-rangeread

Conversation

@jiaqizho
Copy link

@jiaqizho jiaqizho commented Mar 16, 2026

Summary

When a FlatLayout has its array_tree metadata inlined in the footer, we can figure out exactly which bytes of the segment are needed for a given row range without any IO. This lets us issue a single small read instead of fetching the entire segment, which is a big win for point lookups and narrow scans on wide tables.

The range read planner walks the encoding tree (Primitive, Bool, BitPacked, Delta, FoR, ZigZag, ALP, ALPRD, Dict, FixedSizeList, Constant) and computes the minimal contiguous byte range covering the needed buffers. If that range is less than 50% of the full segment, we issue the targeted read; otherwise we fall back to reading the whole segment.

To make Delta work with sub-ranged buffers, Delta::build() now derives child array lengths from len + offset instead of metadata.deltas_len. On disk, offset is always 0 so this is a no-op for the normal decode path, but it lets the range read pass a smaller decode_len without the decoder panicking on buffer size mismatch.

Also adds request_range() to the SegmentSource trait with a default fallback implementation, efficient overrides in FileSegmentSource and BufferSegmentSource, a RangeReadEnabled session flag, and pub const NAME on all encoding structs for pattern matching in the planner.

The current implementation requires the array encoding tree (ArrayNode) to be inlined in the footer via FLAT_LAYOUT_INLINE_ARRAY_NODE=1. Without this flag, the ArrayNode is stored inside the segment data and is not available to the range read planner until the entire segment is fetched (it would be possible to add an extra IO per column to fetch just the ArrayNode from the segment, but the overhead would negate much of the benefit). Since the planner needs the encoding tree to determine which byte ranges to read, range read is effectively disabled without inlining, and every take falls back to reading the full segment. A follow-up change will make inlining the default behavior.

Testing

@jiaqizho
Copy link
Author

jiaqizho commented Mar 16, 2026

Note: If a segment contains a validity bitmap, it falls back to reading the entire segment.

Encodings with range read support

Encoding Description Read Amplification Reduction
Primitive Reads sizeof(type) bytes ~130,000x ~ 1,000,000x
BitPacked Reads one 1024-element chunk ~250x ~ 8,000x
FoR Delegates to child (typically BitPacked) ~250x ~ 8,000x
ALP / ALPRD Encodes to integers, delegates to child ~250x ~ 8,000x
Delta Reads chunk-aligned deltas + corresponding bases ~250x ~ 8,000x
Bool Reads byte-aligned range (1 byte) ~1,000,000x
Dict Reads code sub-range + full dictionary ~250x ~ 8,000x
Constant Value already in metadata - (zero IO)
ZigZag Delegates to child ~250x ~ 8,000x
FixedSizeList Delegates to child Depends on child encoding

Encodings without range read support

Encoding Reason
Sparse Cannot determine sub-range without reading indices first
BitPacked + patches Patch indices are global coordinates, incompatible with sub-ranged data
Delta (offset≠0) Non-zero offset breaks chunk alignment
RunEnd Variable-length runs, cannot map row number to byte offset
RLE (fastlanes) Same as RunEnd, variable-length runs
VarBin Variable-length data, cannot map row to byte offset
VarBinView Variable-length data, cannot map row to byte offset
FSST Compressed variable-length, requires decompression to locate
PCO Opaque compressed blocks, no random access
ByteBool Could be supported but not implemented (low priority, Bool is more common)
DateTimeParts Multi-child container, requires recursive support for all children
DecimalByteParts Multi-child container, requires recursive support for all children
Sequence Container encoding, not a leaf node
Zstd / ZstdBuffers Opaque compressed blocks, no random access

Copy link
Contributor

@joseph-isaacs joseph-isaacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea of sub segment reads however this must be added in an extensible way @gatesn any thoughts on where do we put this?

It seems to me that we want to allow arrays to specify how slice their buffers.

@joseph-isaacs
Copy link
Contributor

joseph-isaacs commented Mar 16, 2026

I think this should be moved to a design discussion, please open one and then we can refine the design before impl this feature.

@gatesn
Copy link
Contributor

gatesn commented Mar 16, 2026

A couple of immediate thoughts:

  1. This is definitely a problem we want to address.
  2. A lot of the nastiness here is because we don't yet have a ListLayout (which means we cannot chunk the elements array independently of the offsets array).
  3. We should change FixedSizeList to be a "view" type, possibly just ListView so that we can slice and shuffle without copying elements. (Somewhat unrelated I think)
  4. It's not widely documented, but if you're doing this from local disk, you may get sufficient pruning from memmap'ing the file, in which case due to Vortex buffer alignment you will get sub-segment slicing out of the box.
  5. As Joe says.... I would love for all arrays to be able to push-down slicing/selection into their I/O layer. I think we might be able to do something here with our BufferHandle abstraction that means arrays cannot assume the buffer is contiguous host memory. This might allow us to push down selection into the buffer, and then call to_host later to materialize it into something useful.

@connortsui20
Copy link
Contributor

just as an aside, I do like the const NAME change. Is it possible to split that out into a separate PR?

@codspeed-hq
Copy link

codspeed-hq bot commented Mar 16, 2026

Merging this PR will improve performance by 11.12%

⚡ 1 improved benchmark
✅ 1008 untouched benchmarks
⏩ 1515 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation binary_search_std 582.8 ns 524.4 ns +11.12%

Comparing jiaqizho:support-rangeread (3d68adc) with develop (876813b)

Open in CodSpeed

Footnotes

  1. 1515 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@jiaqizho
Copy link
Author

jiaqizho commented Mar 17, 2026

@joseph-isaacs @gatesn @connortsui20

Thanks for the thorough review and great suggestions! I've force-pushed an update addressing the feedback.

I have replaced the static match encoding_id dispatch with a VTable::plan_range_read method. Each encoding now declares its own buffer sub-ranges and child recursion strategy via the vtable without centralized match on encoding names. Unsupported encodings return None by default and fall back to full segment reads.

Design discussion: Opened #6991 with the motivation, full design, benchmark data(in my case), and future directions (nullable support, BufferHandle-based approach, etc.).

Would love any further feedback on the design or implementation — happy to keep iterating!

@joseph-isaacs joseph-isaacs added the action/benchmark Trigger full benchmarks to run on this PR label Mar 17, 2026
@connortsui20 connortsui20 added changelog/feature A new feature and removed action/benchmark Trigger full benchmarks to run on this PR labels Mar 17, 2026
@connortsui20
Copy link
Contributor

@jiaqizho If you look at the CI actions, you'll see that this change doesn't pass a few checks (formatting and docs). Let us know when you've fixed it and we can run CI for you!

@jiaqizho jiaqizho force-pushed the support-rangeread branch from 533ce6a to 458273c Compare March 17, 2026 12:08
@jiaqizho
Copy link
Author

jiaqizho commented Mar 17, 2026

@jiaqizho If you look at the CI actions, you'll see that this change doesn't pass a few checks (formatting and docs). Let us know when you've fixed it and we can run CI for you!

@connortsui20 Thanks for the heads up! I've pushed fixes for the formatting and public API lock files. These were generated locally so not 100% sure they match your CI environment — pls let me know if anything still fails.

@connortsui20
Copy link
Contributor

Seems like https://github.com/vortex-data/vortex/actions/runs/23193429228/job/67395598365?pr=6974 is still failing, just need to fix the doc links

When a FlatLayout has its array_tree metadata inlined in the footer, we can
figure out exactly which bytes of the segment are needed for a given row range
without any IO. This lets us issue a single small read instead of fetching the
entire segment, which is a big win for point lookups and narrow scans on wide
tables.

The range read planner walks the encoding tree via `VTable::plan_range_read`,
where each encoding (Primitive, Bool, BitPacked, Delta, FoR, ZigZag, ALP, ALPRD,
Dict, FixedSizeList, Constant, Null, Sequence, ByteBool, DateTimeParts,
DecimalByteParts) declares its own buffer sub-ranges and child recursion strategy.
If the resulting byte range is less than 50% of the full segment, we issue the
targeted read; otherwise we fall back to reading the whole segment.

To make Delta work with sub-ranged buffers, Delta::build() now derives child
array lengths from `len + offset` instead of metadata.deltas_len. On disk,
offset is always 0 so this is a no-op for the normal decode path, but it lets
the range read pass a smaller decode_len without the decoder panicking on buffer
size mismatch.

Also adds `request_range()` to the SegmentSource trait with a default fallback
implementation, efficient overrides in FileSegmentSource and BufferSegmentSource,
a `RangeReadEnabled` session flag, and `ScanBuilder::with_split_row_indices` to
generate per-index tight ranges for point lookups.

Signed-off-by: jiaqizho <jiaqi.zhou@zilliz.com>
@jiaqizho jiaqizho force-pushed the support-rangeread branch from 458273c to 3d68adc Compare March 17, 2026 13:26
@jiaqizho
Copy link
Author

Seems like https://github.com/vortex-data/vortex/actions/runs/23193429228/job/67395598365?pr=6974 is still failing, just need to fix the doc links

@connortsui20 done, retrigger pls, thanks.

@jiaqizho
Copy link
Author

@joseph-isaacs @gatesn @connortsui20

Hi, wanted to follow up on this. The discussion in #6991 has the full design and S3 benchmark data. The key question is whether the vtable-based plan_range_read approach addresses the extensibility concern from the original PR review. Let me know if you'd prefer a different direction — happy to iterate.

@connortsui20
Copy link
Contributor

See #6991 (comment) for more discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants