-
Notifications
You must be signed in to change notification settings - Fork 1
PatchedArray #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
PatchedArray #27
Changes from 7 commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
e8112f4
save
a10y 9d51946
save
a10y deb2e49
rename
a10y 3419293
link
a10y 5fbb7df
more
a10y b2a9140
ok
a10y c12dd5e
add header
a10y 5fb3cf8
update
a10y 414371b
arrayref
a10y e5b5a72
more
a10y e8519c9
proposed -> accepted
a10y File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,186 @@ | ||
| - Start Date: 2026-03-02 | ||
| - Tracking Issue: TBD | ||
| - Draft PR: https://github.com/vortex-data/vortex/pull/6815 | ||
|
|
||
| ## Summary | ||
|
|
||
| Make a backwards compatible change to the serialization format for `Patches` used by the FastLanes-derived encodings: | ||
|
|
||
| - BitPacked | ||
| - ALP | ||
| - ALP-RD | ||
|
|
||
| enabling fully data-parallel patch application inside of the CUDA bit-unpacking kernels, while not impacting | ||
| CPU performance. | ||
|
|
||
| This relies on introducing a new encoding to represent exception patching, which would be a forward-compatibility break | ||
| as is always the case when adding a new default encoding. | ||
|
|
||
| --- | ||
|
|
||
| ## Data Layout | ||
|
|
||
| Patches have a new layout, influenced by the [G-ALP paper](https://ir.cwi.nl/pub/35205/35205.pdf) from CWI. | ||
|
|
||
| The key insight of the paper is that instead of holding the patches sorted by their global offset, instead | ||
|
|
||
| - Group patches into 1024-element chunks | ||
| - Further group the patches within each chunk by their "lanes", where the lane is w/e the lane of the underlying operation you're patching over aligns to | ||
|
|
||
| For example, let's say that we have an array of 5,000 elements, with 32 lanes. | ||
|
|
||
| - We'd have $\left\lceil\frac{5,000}{1024}\right\rceil = 5$ chunks, each chunk has 32 lanes. Each lane can have up to 32 patch values | ||
| - Indices and values are aligned. Indices are indices within a chunk, so they can be stored as u16. Values are whatever the underlying values type is. | ||
|
|
||
| ```text | ||
|
|
||
| chunk 0 chunk 0 chunk 0 chunk 0 chunk 0 chunk 0 | ||
| lane 0 lane 1 lane 2 lane 3 lane 4 lane 5 | ||
| ┌────────────┬────────────┬────────────┬────────────┬────────────┬────────────┐ | ||
| lane_offsets │ 0 │ 0 │ 2 │ 2 │ 3 │ 5 │ ... | ||
| └─────┬──────┴─────┬──────┴─────┬──────┴──────┬─────┴──────┬─────┴──────┬─────┘ | ||
| │ │ │ │ │ │ | ||
| │ │ │ │ │ │ | ||
| ┌─────┴────────────┘ └──────┬──────┘ ┌──────┘ └─────┐ | ||
| │ │ │ │ | ||
| │ │ │ │ | ||
| │ │ │ │ | ||
| ▼────────────┬────────────┬────────────▼────────────▼────────────┬────────────▼ | ||
| indices │ │ │ │ │ │ │ | ||
| │ │ │ │ │ │ │ | ||
| ├────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤ | ||
| values │ │ │ │ │ │ │ | ||
| │ │ │ │ │ │ │ | ||
| └────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘ | ||
| ``` | ||
|
|
||
| This layout has a few benefits | ||
|
|
||
| - For GPU operations, each warp handles a single chunk, and each thread handles a single lane. Through the `lane_offsets`, each thread of execution can have quick random access to an iterator of values | ||
| - Patches can be trivially sliced to a specific chunk range simply by slicing into the `lane_offsets` | ||
| - Bulk operations can be executed efficiently per-chunk by loading all patches for a chunk and applying them in a loop, as before | ||
| - Point lookups are still efficient. Convert the target index into the chunk/lane, then do a linear scan for the index. There will be at most `1024 / N_LANES` patches, which in our current implementation is 64. A linear search with loop unrolling should be able to execute this extremely fast on hardware with SIMD registers. | ||
|
|
||
| --- | ||
|
|
||
| ## Array Structure | ||
|
|
||
| ```rust | ||
| /// An array that partially "patches" another array with new values. | ||
| /// | ||
| /// Patched arrays implement the set of nodes that do this instead here...I think? | ||
| #[derive(Debug, Clone)] | ||
| pub struct PatchedArray { | ||
| /// The inner array that is being patched. This is the zeroth child. | ||
| pub(super) inner: ArrayRef, | ||
|
|
||
| /// Number of 1024-element chunks. Pre-computed for convenience. | ||
| pub(super) n_chunks: usize, | ||
|
|
||
| /// Number of lanes the patch indices and values have been split into. Each of the `n_chunks` | ||
| /// of 1024 values is split into `n_lanes` lanes horizontally, each lane having 1024 / n_lanes | ||
| /// values that might be patched. | ||
| pub(super) n_lanes: usize, | ||
|
|
||
| /// Offset into the first chunk | ||
| pub(super) offset: usize, | ||
| /// Total length. | ||
| pub(super) len: usize, | ||
|
|
||
| /// lane offsets. The PType of these MUST be u32 | ||
| pub(super) lane_offsets: BufferHandle, | ||
| /// indices within a 1024-element chunk. The PType of these MUST be u16 | ||
| pub(super) indices: BufferHandle, | ||
| /// patch values corresponding to the indices. The ptype is specified by `values_ptype`. | ||
| pub(super) values: BufferHandle, | ||
| /// PType of the scalars in `values`. Can be any native type. | ||
| pub(super) values_ptype: PType, | ||
|
|
||
| pub(super) stats_set: ArrayStats, | ||
| } | ||
| ``` | ||
|
|
||
| The PatchedArray holds buffer handles for the `lane_offsets` which provides chunk/lane-level random indexing | ||
| into the patch `indices` and `values`, so these values can live equivalently in device or host memory. | ||
|
|
||
| The only operation performed at planning time is slicing, which means that all of its reduce rules would run | ||
| without issue in CUDA or on CPU. | ||
|
|
||
| --- | ||
|
|
||
| # Operations | ||
|
|
||
| ## Slicing | ||
|
|
||
| We look at the slice indices, align them to chunk boundaries, then slice both the child and the patches to chunk boundaries, and preserve the offset + len to apply the final intra-chunk slice at execution time. | ||
|
|
||
| ## Filter / Take Execution | ||
|
|
||
| Filter / Take operations can arbitrarily break and reconstruct new chunks, so they cannot be done metadata-only and thus must be a Kernel rather than a Reduce rule. | ||
|
|
||
| In practice, we perform the operation by | ||
|
|
||
| - Executing the filter on the child, then executing it | ||
| - Intersecting the filter with our patches, ideally in a chunk-at-a-time way so we can write a vectorized version. | ||
| - Applying the filtered patches over the executed child | ||
|
|
||
| ## ScalarFns | ||
|
|
||
| We do not reduce any ScalarFns through the operation, instead they only run at execution time. | ||
|
|
||
| This matches the current behavior of BitPackedArrays. | ||
|
|
||
| --- | ||
|
|
||
| ## Compatibility | ||
|
|
||
| BitPackedArray and ALPArray both hold a `Patches` internally, which we'd like to replace by wrapping them in a `PatchedArray`. | ||
|
|
||
| To do this without breaking backward compatibility, we modify the `VTable::build` function to return `ArrayRef`. This makes it easy to do encoding migrations on read in the future. The alternative is adding a new BitPackedArray and ALPArray that gets migrated to on write. | ||
|
|
||
| This requires executing the Patches at read time. From scanning a handful of our tables, this is unlikely to cause any issues as patches are generally not compressed. We only apply constant compression for patch values, and I would expect that to be rare in practice. | ||
|
|
||
| ## Drawbacks | ||
|
|
||
| This will be a forward-compatibility break. Old clients will not be able to read files written with the new encoding. | ||
| However, the potential break surface is huge given how ubiquitous bitpacked arrays and patches are in our encoding trees. | ||
| This will cause friction as users of Vortex who have separate writer/reader pipelines will need to upgrade their Vortex | ||
| clients across both in lockstep. | ||
|
|
||
| > Does this add complexity that could be avoided? | ||
|
|
||
| IMO this centralizes some complexity that previously was shared across multiple encodings. | ||
|
|
||
| ## Alternatives | ||
|
|
||
| > Transpose the patches within GPU execution | ||
|
|
||
| This was found to be not very performant. The time spent D2H copy, transpose patches, H2D copy far exceeded the cost of executing the bitpacking kernel, which puts a serious | ||
| limit on our GPU scan performance. Combined with how ubiquitous `BitPackedArray`s with patches are in our encoding trees, would be a permanent bottleneck on throughput. | ||
|
|
||
| > What is the cost of **not** doing this? | ||
|
|
||
| Our GPU scan performance would be permanently limited by patching overhead, which in TPC-H lineitem scans was shown to be the biggest bottleneck after string decoding. | ||
|
|
||
| > Is there a simpler approach that gets us most of the way there? | ||
|
|
||
| I don't think so | ||
|
|
||
| ## Prior Art | ||
|
|
||
| The original FastLanes GPU paper did not attempt to implement data-parallel patching within the FastLanes unpacking | ||
| kernels. | ||
|
|
||
| The G-ALP paper was published later on, and implemented patching for ALP values _after_ unpacking. | ||
|
|
||
| We use a data layout that closely matches the one described in _G-ALP_ and apply it to bit-unpacking as well. | ||
|
|
||
| ## Unresolved Questions | ||
|
|
||
| - What parts of the design need to be resolved during the RFC process? | ||
| - What is explicitly out of scope for this RFC? | ||
| - Are there open questions that can be deferred to implementation? | ||
|
|
||
| ## Future Possibilities | ||
|
|
||
| What natural extensions or follow-on work does this enable? This is a good place to note related ideas that are out of scope for this RFC but worth capturing. | ||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want these to be uncompressed and never compressed in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we assume that patches are only 0.5-1% of the overall array then I think compression is sort of superfluous, yea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree strongly here, you can always write decompressed arrays but will find it much harder to go the other way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's valid, maybe i'm underestimating how likely this is to change in the future. let me update the PR to hold child values and see how that works out