Skip to content

Updated Variant array and the new VariantGet expression#7877

Open
AdamGS wants to merge 25 commits into
developfrom
adamg/yet-another-variant-array
Open

Updated Variant array and the new VariantGet expression#7877
AdamGS wants to merge 25 commits into
developfrom
adamg/yet-another-variant-array

Conversation

@AdamGS
Copy link
Copy Markdown
Contributor

@AdamGS AdamGS commented May 11, 2026

Summary

This PR includes two big changes as Variant moves closer to readiness.

  1. Potentially holding the shredded child of a variant array in the canonical VariantArray
  2. A VariantGet expression that can pull extract data out of variant arrays, either in a typed way or as a more opaque Variant.

For reviewers, some relevant context might be:

  1. The VariantGet RFC: this RFC takes some lessons I've learned working on this into account and reflects my updated view of this problem.
  2. The original Variant type RFC

I think the Parquet spec is also a pretty good read and a very heavy influence of this work - Shredding and the Variant type.

@AdamGS AdamGS force-pushed the adamg/yet-another-variant-array branch from e6ed40d to 54d4015 Compare May 11, 2026 13:17
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 11, 2026

Merging this PR will improve performance by 21.12%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 32 improved benchmarks
✅ 1184 untouched benchmarks
⏩ 24 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation take_search[(0.005, 0.05)] 168.5 µs 132 µs +27.59%
Simulation take_search[(0.005, 0.1)] 320.4 µs 247.6 µs +29.42%
Simulation take_search[(0.005, 0.5)] 1.5 ms 1.2 ms +31.01%
Simulation take_search[(0.005, 1.0)] 3.1 ms 2.3 ms +31.24%
Simulation take_search[(0.01, 0.05)] 179.6 µs 143.1 µs +25.45%
Simulation take_search[(0.01, 0.1)] 341.3 µs 268.5 µs +27.13%
Simulation take_search[(0.01, 0.5)] 1.6 ms 1.3 ms +28.56%
Simulation take_search[(0.01, 1.0)] 3.3 ms 2.5 ms +28.76%
Simulation take_search[(0.1, 0.05)] 249.2 µs 212.8 µs +17.12%
Simulation take_search[(0.1, 0.1)] 458.8 µs 385.9 µs +18.88%
Simulation take_search[(0.1, 0.5)] 2.2 ms 1.8 ms +20.39%
Simulation take_search[(0.1, 1.0)] 4.3 ms 3.5 ms +20.62%
Simulation take_search_chunked[(0.005, 0.05)] 193.3 µs 162.3 µs +19.12%
Simulation take_search_chunked[(0.005, 0.1)] 369.3 µs 307.4 µs +20.16%
Simulation take_search_chunked[(0.005, 0.5)] 1.8 ms 1.5 ms +21.06%
Simulation take_search_chunked[(0.005, 1.0)] 3.5 ms 2.9 ms +21.19%
Simulation take_search_chunked[(0.01, 0.05)] 206.3 µs 175.4 µs +17.67%
Simulation take_search_chunked[(0.01, 0.1)] 394 µs 332.1 µs +18.65%
Simulation take_search_chunked[(0.01, 0.5)] 1.9 ms 1.6 ms +19.48%
Simulation take_search_chunked[(0.01, 1.0)] 3.8 ms 3.2 ms +19.59%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing adamg/yet-another-variant-array (0ff0f40) with develop (7349cd6)

Open in CodSpeed

Footnotes

  1. 24 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@AdamGS AdamGS requested a review from mprammer May 11, 2026 16:19
@AdamGS AdamGS added the changelog/feature A new feature label May 11, 2026
@AdamGS AdamGS requested review from onursatici and robert3005 May 12, 2026 10:26
@AdamGS AdamGS changed the title WIP: Hopefully final Variant iteration Updated Variant array and the new VariantGet expression May 12, 2026
@AdamGS AdamGS marked this pull request as ready for review May 12, 2026 10:31
Comment on lines 712 to 760
Canonical::Bool(b) => {
let validity = child_to_validity(b.slots()[0].as_ref(), b.dtype().nullability());
let len = b.len();
let BoolDataParts { bits, offset, len } = b.into_data().into_parts(len);
Ok(RecursiveCanonical(Canonical::Bool(
BoolArray::try_new_from_handle(bits, offset, len, validity.execute(ctx)?)?,
)))
}
Canonical::Primitive(p) => {
let PrimitiveDataParts {
ptype,
buffer,
validity,
} = p.into_data_parts();
Ok(RecursiveCanonical(Canonical::Primitive(unsafe {
PrimitiveArray::new_unchecked_from_handle(buffer, ptype, validity.execute(ctx)?)
})))
}
Canonical::Decimal(d) => {
let DecimalDataParts {
decimal_dtype,
values,
values_type,
validity,
} = d.into_data_parts();
Ok(RecursiveCanonical(Canonical::Decimal(unsafe {
DecimalArray::new_unchecked_handle(
values,
values_type,
decimal_dtype,
validity.execute(ctx)?,
)
})))
}
Canonical::VarBinView(vbv) => {
let VarBinViewDataParts {
dtype,
buffers,
views,
validity,
} = vbv.into_data_parts();
Ok(RecursiveCanonical(Canonical::VarBinView(unsafe {
VarBinViewArray::new_handle_unchecked(
views,
buffers,
dtype,
validity.execute(ctx)?,
)
})))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want to add a todo for using recursively_canonicalize_slots for all arrays here.

Comment on lines +136 to +147
let mut chunks = Vec::with_capacity(input.len());

for idx in 0..input.len() {
let scalar = input.execute_scalar(idx, ctx)?;
let output = variant_get_scalar(&scalar, options, &dtype)?;
chunks.push(ConstantArray::new(output, 1).into_array());
}

let array = ChunkedArray::try_new(chunks, dtype.clone())?.into_array();
if dtype.is_variant() {
VariantArray::try_new(array, None).map(|array| array.into_array())
} else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks very slow, can you add a comment explaining this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the most general behavior we have, ideally we push it down into children that can do smarter stuff.


Ok(options
.dtype()
.map_or(DType::Variant(Nullability::Nullable), DType::as_nullable))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you store as option dtype as possibly non-nullable? There is a hidden cast

Comment thread vortex-array/src/arrays/variant/mod.rs Outdated
pub trait VariantArrayExt: TypedArrayRef<Variant> {
fn child(&self) -> &ArrayRef {
self.as_ref().slots()[0]
/// Returns the raw storage that preserves the full variant value for every row.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this does contain shredded data again

Comment on lines +49 to +67
pub fn try_new(core_storage: ArrayRef, shredded: Option<ArrayRef>) -> VortexResult<Self> {
let dtype = core_storage.dtype().clone();
vortex_ensure!(
matches!(dtype, DType::Variant(_)),
"VariantArray core_storage dtype must be Variant, found {dtype}"
);
let len = core_storage.len();
let stats = core_storage.statistics().to_owned();
Ok(Array::try_from_parts(
ArrayParts::new(Variant, dtype, len, EmptyArrayData)
.with_slots(vec![Some(core_storage), shredded].into()),
)?
.with_stats_set(stats))
}

/// Creates a new `VariantArray`.
pub fn new(child: ArrayRef) -> Self {
let dtype = DType::Variant(child.dtype().nullability());
let len = child.len();
let stats = child.statistics().to_owned();
unsafe {
Array::from_parts_unchecked(
ArrayParts::new(Variant, dtype, len, EmptyArrayData)
.with_slots(smallvec![Some(child)]),
)
pub fn new(core_storage: ArrayRef) -> Self {
Self::try_new(core_storage, None).vortex_expect("invalid VariantArray core_storage")
}
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods have odd names and parameters


impl Array<Variant> {
/// Creates a new `VariantArray` with raw core storage and optional shredded storage.
pub fn try_new(core_storage: ArrayRef, shredded: Option<ArrayRef>) -> VortexResult<Self> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here explain or link to what you pass here

@robert3005
Copy link
Copy Markdown
Contributor

@claude can you review this?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 12, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@@ -198,7 +223,12 @@ fn scalar_from_shredded_object_scalar(
if !unshredded.is_null() {
let unshredded = unshredded.as_struct();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as_struct here should not panic here as long as the underlying parquet array is not corrupt right? Any present unshredded data when there is a shredded value has to be a struct

})
.transpose()?;
let core_storage = core_storage_without_typed_value(&array)?;
Ok(ExecutionResult::done(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there must be lots of discussion about this and I think having variant as a dtype is fine, but it is still weird for me to have a fully shredded int array and we still have to canonicalise it into a vortex variant array with empty core_storage, instead of returning a primitive array

// TODO(joe)[#7674]: iterative execution here too
Ok(ExecutionResult::done(_canonicalize(array.as_view(), ctx)?))
}
DType::Variant(..) => Ok(ExecutionResult::done(array)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't we potentially returning a non canonical array here? the executor would try to call execute repeatedly getting back the same array then bail failing to match the array with canonical?

Copy link
Copy Markdown
Contributor

@robert3005 robert3005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think everything apart from ChunkedArray(Variant) execute are just nits

Comment thread encodings/parquet-variant/src/array.rs Outdated
Comment thread encodings/parquet-variant/src/array.rs Outdated
Comment thread vortex-array/src/arrays/variant/vtable/kernel.rs Outdated
Comment thread vortex-array/src/arrays/variant/vtable/kernel.rs Outdated
Comment thread vortex-array/src/arrays/variant/vtable/kernel.rs Outdated
};

let typed = shredded.execute_scalar(index, ctx)?;
let fallback = (typed.is_null() || typed.dtype().is_struct())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why struct as well?

chunks.push(ConstantArray::new(output, 1).into_array());
}

let array = ChunkedArray::try_new(chunks, dtype.clone())?.into_array();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very weird. How does this become canonical?

// TODO(joe)[#7674]: iterative execution here too
Ok(ExecutionResult::done(_canonicalize(array.as_view(), ctx)?))
}
DType::Variant(..) => Ok(ExecutionResult::done(array)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems incorrect. Can you add a test for canonicalizing variant_get expression?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment thread vortex-array/src/arrays/masked/execute.rs Outdated
let fallback = make_fallback(ctx)?;
let typed_mask = typed.is_not_null()?;
typed_mask
.zip(typed, fallback)?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a follow up, we are computing the entire fallback array here but only take indices where typed_mask is false, so we can narrow the core_storage to only the false indices and compute from there then zip

@AdamGS AdamGS force-pushed the adamg/yet-another-variant-array branch 2 times, most recently from e16b60d to 7546876 Compare May 14, 2026 12:11
AdamGS added 8 commits May 14, 2026 14:33
Signed-off-by: "Adam Gutglick" <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: "Adam Gutglick" <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Expose canonical VariantArray core_storage and optional shredded children, and surface Parquet typed_value as canonical shredded storage during canonicalization.

Signed-off-by: "Adam Gutglick" <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Keep canonical VariantArray core_storage and shredded children row-aligned through slice, filter, take, mask validity, and canonicalization paths.

Signed-off-by: "Adam Gutglick" <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: "Adam Gutglick" <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: "Adam Gutglick" <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
AdamGS added 17 commits May 14, 2026 14:33
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: cf8065c
I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: fdc1e44
I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: 8ef03b2
I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: 9b6eeb4
I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: c0764b3
I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: 5ff2cb8

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: be87775
I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: 175ed3f
I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: c2d25ff
I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: c92afe0
I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: 67895f8
I, Adam Gutglick <adam@spiraldb.com>, hereby add my Signed-off-by to this commit: 8f61afc

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
@AdamGS AdamGS force-pushed the adamg/yet-another-variant-array branch from 7546876 to 0ff0f40 Compare May 14, 2026 13:33
Comment thread vortex-array/src/canonical.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants