Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -442,7 +442,7 @@ async function getHighlighter(): Promise<Highlighter> {
if (!highlighter) {
highlighter = await createHighlighter({
themes: ["github-light", "github-dark"],
langs: ["rust", "python", "markdown"],
langs: ["rust", "python", "markdown", "cpp", "c"],
});
}
return highlighter;
Expand Down Expand Up @@ -554,6 +554,16 @@ async function build(liveReload: boolean = false): Promise<number> {
await Bun.write("dist/vortex_logo.svg", await logo.text());
}

// Copy all static assets to dist/static/
await $`mkdir -p dist/static`.quiet();
const staticGlob = new Bun.Glob("*");
for await (const filename of staticGlob.scan("./static")) {
const src = Bun.file(`static/${filename}`);
const dest = `dist/static/${filename}`;
await Bun.write(dest, src);
console.log(`Copied static/${filename} -> ${dest}`);
}

// Generate index page
const indexHTML = indexPage(rfcs, repoUrl, liveReload);
await Bun.write("dist/index.html", indexHTML);
Expand Down
186 changes: 186 additions & 0 deletions proposed/0027-patches-format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
- Start Date: 2026-03-02
- Tracking Issue: TBD
- Draft PR: https://github.com/vortex-data/vortex/pull/6815

## Summary

Make a backwards compatible change to the serialization format for `Patches` used by the FastLanes-derived encodings:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why limit to fastlanes encodings what about sparse arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just because the whole "lanes" concept only maps cleanly to primitives.

I suppose this could help us write a data-parallel version of sparsearray though...


- BitPacked
- ALP
- ALP-RD

enabling fully data-parallel patch application inside of the CUDA bit-unpacking kernels, while not impacting
CPU performance.

This relies on introducing a new encoding to represent exception patching, which would be a forward-compatibility break
as is always the case when adding a new default encoding.
Comment on lines +16 to +17
Copy link
Contributor Author

@a10y a10y Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as is always the case when adding a new default encoding

This is not the purpose of this RFC, but just calling out this is going to continue to be annoying, I see a few alternatives here

  1. All new encodings need to be gated behind a Writer flag so they are not written unless you explicitly opt-in. Then after some number of releases they can be enabled by default.
  2. Come back around to the idea of distributing encodings as WASM binaries, seems unlikely to be picked up very widely
  3. NEVER allow new encodings within a single "edition". We'd need to formalize what an edition means, how frequently we drop one, and how we maintain and test encodings on develop between edition releases.


---

## Data Layout

Patches have a new layout, influenced by the [G-ALP paper](https://ir.cwi.nl/pub/35205/35205.pdf) from CWI.

The key insight of the paper is that instead of holding the patches sorted by their global offset, instead

- Group patches into 1024-element chunks
- Further group the patches within each chunk by their "lanes", where the lane is w/e the lane of the underlying operation you're patching over aligns to

For example, let's say that we have an array of 5,000 elements, with 32 lanes.

- We'd have $\left\lceil\frac{5,000}{1024}\right\rceil = 5$ chunks, each chunk has 32 lanes. Each lane can have up to 32 patch values
- Indices and values are aligned. Indices are indices within a chunk, so they can be stored as u16. Values are whatever the underlying values type is.

```text

chunk 0 chunk 0 chunk 0 chunk 0 chunk 0 chunk 0
lane 0 lane 1 lane 2 lane 3 lane 4 lane 5
┌────────────┬────────────┬────────────┬────────────┬────────────┬────────────┐
lane_offsets │ 0 │ 0 │ 2 │ 2 │ 3 │ 5 │ ...
└─────┬──────┴─────┬──────┴─────┬──────┴──────┬─────┴──────┬─────┴──────┬─────┘
│ │ │ │ │ │
│ │ │ │ │ │
┌─────┴────────────┘ └──────┬──────┘ ┌──────┘ └─────┐
│ │ │ │
│ │ │ │
│ │ │ │
▼────────────┬────────────┬────────────▼────────────▼────────────┬────────────▼
indices │ │ │ │ │ │ │
│ │ │ │ │ │ │
├────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
values │ │ │ │ │ │ │
│ │ │ │ │ │ │
└────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘
```

This layout has a few benefits

- For GPU operations, each warp handles a single chunk, and each thread handles a single lane. Through the `lane_offsets`, each thread of execution can have quick random access to an iterator of values
- Patches can be trivially sliced to a specific chunk range simply by slicing into the `lane_offsets`
- Bulk operations can be executed efficiently per-chunk by loading all patches for a chunk and applying them in a loop, as before
- Point lookups are still efficient. Convert the target index into the chunk/lane, then do a linear scan for the index. There will be at most `1024 / N_LANES` patches, which in our current implementation is 64. A linear search with loop unrolling should be able to execute this extremely fast on hardware with SIMD registers.

---

## Array Structure

```rust
/// An array that partially "patches" another array with new values.
///
/// Patched arrays implement the set of nodes that do this instead here...I think?
#[derive(Debug, Clone)]
pub struct PatchedArray {
/// The inner array that is being patched. This is the zeroth child.
pub(super) inner: ArrayRef,

/// Number of 1024-element chunks. Pre-computed for convenience.
pub(super) n_chunks: usize,

/// Number of lanes the patch indices and values have been split into. Each of the `n_chunks`
/// of 1024 values is split into `n_lanes` lanes horizontally, each lane having 1024 / n_lanes
/// values that might be patched.
pub(super) n_lanes: usize,

/// Offset into the first chunk
pub(super) offset: usize,
/// Total length.
pub(super) len: usize,
Comment on lines +86 to +88

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these size bounds on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not on len, but offset < 1024. I just used usize just for indexing convenience


/// lane offsets. The PType of these MUST be u32
pub(super) lane_offsets: BufferHandle,
/// indices within a 1024-element chunk. The PType of these MUST be u16
pub(super) indices: BufferHandle,
/// patch values corresponding to the indices. The ptype is specified by `values_ptype`.
pub(super) values: BufferHandle,
Comment on lines +93 to +95
Copy link

@joseph-isaacs joseph-isaacs Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want these to be uncompressed and never compressed in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assume that patches are only 0.5-1% of the overall array then I think compression is sort of superfluous, yea.

/// PType of the scalars in `values`. Can be any native type.
pub(super) values_ptype: PType,

pub(super) stats_set: ArrayStats,
}
```

The PatchedArray holds buffer handles for the `lane_offsets` which provides chunk/lane-level random indexing
into the patch `indices` and `values`, so these values can live equivalently in device or host memory.

The only operation performed at planning time is slicing, which means that all of its reduce rules would run

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will you do here?

without issue in CUDA or on CPU.

---

# Operations

## Slicing

We look at the slice indices, align them to chunk boundaries, then slice both the child and the patches to chunk boundaries, and preserve the offset + len to apply the final intra-chunk slice at execution time.

## Filter / Take Execution

Filter / Take operations can arbitrarily break and reconstruct new chunks, so they cannot be done metadata-only and thus must be a Kernel rather than a Reduce rule.

In practice, we perform the operation by

- Executing the filter on the child, then executing it
- Intersecting the filter with our patches, ideally in a chunk-at-a-time way so we can write a vectorized version.
- Applying the filtered patches over the executed child

## ScalarFns

We do not reduce any ScalarFns through the operation, instead they only run at execution time.

This matches the current behavior of BitPackedArrays.

---

## Compatibility

BitPackedArray and ALPArray both hold a `Patches` internally, which we'd like to replace by wrapping them in a `PatchedArray`.

To do this without breaking backward compatibility, we modify the `VTable::build` function to return `ArrayRef`. This makes it easy to do encoding migrations on read in the future. The alternative is adding a new BitPackedArray and ALPArray that gets migrated to on write.

This requires executing the Patches at read time. From scanning a handful of our tables, this is unlikely to cause any issues as patches are generally not compressed. We only apply constant compression for patch values, and I would expect that to be rare in practice.

## Drawbacks

This will be a forward-compatibility break. Old clients will not be able to read files written with the new encoding.
However, the potential break surface is huge given how ubiquitous bitpacked arrays and patches are in our encoding trees.
This will cause friction as users of Vortex who have separate writer/reader pipelines will need to upgrade their Vortex
clients across both in lockstep.

> Does this add complexity that could be avoided?

IMO this centralizes some complexity that previously was shared across multiple encodings.

## Alternatives

> Transpose the patches within GPU execution

This was found to be not very performant. The time spent D2H copy, transpose patches, H2D copy far exceeded the cost of executing the bitpacking kernel, which puts a serious
limit on our GPU scan performance. Combined with how ubiquitous `BitPackedArray`s with patches are in our encoding trees, would be a permanent bottleneck on throughput.

> What is the cost of **not** doing this?

Our GPU scan performance would be permanently limited by patching overhead, which in TPC-H lineitem scans was shown to be the biggest bottleneck after string decoding.

> Is there a simpler approach that gets us most of the way there?

I don't think so

## Prior Art

The original FastLanes GPU paper did not attempt to implement data-parallel patching within the FastLanes unpacking
kernels.

The G-ALP paper was published later on, and implemented patching for ALP values _after_ unpacking.

We use a data layout that closely matches the one described in _G-ALP_ and apply it to bit-unpacking as well.

## Unresolved Questions

- What parts of the design need to be resolved during the RFC process?
- What is explicitly out of scope for this RFC?
- Are there open questions that can be deferred to implementation?

## Future Possibilities

What natural extensions or follow-on work does this enable? This is a good place to note related ideas that are out of scope for this RFC but worth capturing.
Binary file added static/galp-fig1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.