[Variant] Fix the variant shred logic by klion26 · Pull Request #10157 · apache/arrow-rs

klion26 · 2026-06-18T05:56:45Z

Which issue does this PR close?

Closes Fix the shred logic in parquet-variant #10145.

What changes are included in this PR?

Seperate the logic for variant_shred and variant_get by introduce shred flag for row builder
Move some logic in Variant::as_xxx to type_conversion, and these moved code will be used when variant_get, variant_shred will use Variant::as_xxx
After the change, when shredding a variant, all Variant::as_xx() is identity function now, Variant::Int8 can only be treated as int8, but not int16/int32/int64, etc.
Removed the Variant::as_f16
Add a test to cover that shred can/can't be shredded to some datatype in test_variant_type_shredded_correctly

Are these changes tested?

Yes, added some tests to cover the logic

Are there any user-facing changes?

Yes, some Variant::as_xx logic have been changed.

Previously, variant_shred and variant_get reuse the same code path, causing type mismatches (e.g., bool was incorrectly shredded as int). This commit separates the shred path and ensures each type is converted correctly.

klion26

@scovich @sdf-jkl Please help review this when you're free, thanks.

klion26 · 2026-06-18T06:00:06Z

-impl_primitive_from_variant!(datatypes::Float32Type, as_f32);
-impl_primitive_from_variant!(datatypes::Float64Type, as_f64);
-impl_primitive_from_variant!(datatypes::Date32Type, as_naive_date, |v| {
+enum NumericKind {


These code are mainly moved from Variant.rs, the change here was adding the parameter shred for from_variant and variant_to_unscaled_decimal.

sdf-jkl

Hey @klion26 I took a look.

Overall I like the approach but I think we can improve it.

To follow more closely the existing split between arrow-array and arrow-cast we can only keep identity accessors for just the specific type without widening/narrowing. (i.e. Variant::as_i32 only if the Variant is i32) (it'd also address the lossy cast comment below)

And as for switching between shred_variant and variant_get cast semantics, we could pass them through variant_cast_with_options and add an argument for cast semantics we use { shred, get }

sdf-jkl · 2026-06-21T21:00:06Z

 /// [specification]: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
 /// [Variant Shredding specification]: https://github.com/apache/parquet-format/blob/master/VariantShredding.md
 ///
-/// # Casting Semantics


I originally thought it's a wrong idea to decouple cast from parquet-variant crate, but now I think it's very similar to how arrow-array and arrow-cast are decoupled.

sdf-jkl · 2026-06-21T21:16:13Z

                cast_options,
                capacity,
                NullValue::ArrayElement,
+                false,


Suggested change

false,

true,

if we keep false here it preserves the old too lenient cast semantics.

The branch explicitly checks if shredded is true.

Can check with (doesn't pass):

// `VariantToListArrowRowBuilder::try_new` hardcodes `shred = false` for its element // builder (see variant_to_arrow.rs). The struct-field path threads `shred` correctly, so this // asymmetry is invisible without a nested-list test. #[test] fn test_array_shredding_does_not_cast_bool_element_to_int() { let input = build_variant_array(vec![VariantRow::List(vec![ VariantValue::from(1i64), VariantValue::from(true), VariantValue::from(2i64), ])]); let list_schema = DataType::List(Arc::new(Field::new("item", DataType::Int64, true))); let result = shred_variant(&input, &list_schema).unwrap(); assert_eq!(result.len(), 1); assert_list_structure_and_elements::<Int64Type, i32>( &result, 1, &[0, 3], &[Some(3)], &[None], ( // element 1 (the bool) must NOT be shredded into typed_value ... &[Some(1), None, Some(2)], // ... it must be preserved verbatim in the residual `value` instead. &[None, Some(Variant::BooleanTrue), None], ), ); }

As, thanks for pointing this out, this was my bad, I used false as a placeholder, but did not change me all.

sdf-jkl · 2026-06-21T21:19:35Z

                cast_options,
                capacity,
                NullValue::ArrayElement,
+                false,


Suggested change

false,

true,

same as above

sdf-jkl · 2026-06-22T00:02:31Z

-        self.as_num()
+        match *self {
+            Variant::Float(i) => Some(i),
+            Variant::Double(i) => Some(i as f32),


Should shred do lossy narrowing f64 to f32?

klion26

@sdf-jkl Thanks for the review, I've addressed the comments, please take another look when you're free.

For the identity get for Variant::as_xx, I have thought that, but keep it as now, because the underlying physical data is the same(don't do lossless conversion), but I'm open for this question.

For the switching between shred_variant and variant_get, I left the switch logic to the definition of the builder -- here we know how to do the conversion better, variant_cast_with_options seems like a wrapper of them, so I keep as it for now.

klion26 · 2026-06-25T02:16:40Z

After another look of the code, I perfer the identity function for Variant::as_xx now, as some double can't convert to float even located in [f32::MIN, f32::MAX], this makes the semantics of Variant::as_f32 tricky(users may not know which double(f64) can or can't be convert to f32 easily), -- for shred, this may led to some double (in the same array) shred to f32 into typed_value column, and some double(which can't convert into f32 lossless) located into value column.

After changing to the identity function for Variant::as_xxx, there is a problem for Variant::as_u8/16/32/64, how do we want to handle this?
if we align the semantics for all of the Variant::as_xxx, maybe we can only return Some(value) for Variant::as_u8 when it is Variant::Int8 and None for the other variants, etc

fn as_u8(&self) -> Option<u8> {
   match *self {
     Variant::Int8(i) => i.try_into().ok(),
     _ => None,
   }
}

...

fn as_u64(&self) -> Option<u64> {
    match *self {
       Variant::Int64(i) => i.try_into().ok(),
      _ => None,
}

sdf-jkl · 2026-06-25T03:59:07Z

If we go identity functions for Variant::as_xx we can drop Variant::as_u8/16/32/64 from parquet-variant entirely and keep them only as cast semantics in parquet-variant-compute?

We don't shred to unsigned types and the only path to use it would be cast in variant_get, so parquet-variant can safely drop them.

klion26 · 2026-06-25T11:51:16Z

I've update the code,

changed the Variant::as_xx as identity function
removed Variant::as_f16, -- will always return None for as_f16 if we keep it.

but don't remove the Variant::as_u8/16/32/64 for now, we use Variant::as_xx in the shred logic, but Variant::as_xx is not for shred use only, so I don't remove them for now, Variant::as_u8 returns Some(v) for Variant::Int8 but None for other variants. I can remove them if reached to a consensuse that we want to remove Variant::as_u8/16/32/64

klion26 · 2026-06-25T11:53:03Z

+        }
+    }
+);
+impl_primitive_from_variant!(


Align the cast logic with arrow-cast for Time related variant will in a followup pr

sdf-jkl

Thanks @klion26.

I'm still not on board with keeping as_u* APIs because they are not identity cast (since there's no unsigned Variant and we can't shred to unsigned) but let's leave it to @alamb / @scovich.

Some doc fixes and a behavioral bug for decimal handling.

sdf-jkl · 2026-06-26T14:13:33Z

-    ///  use parquet_variant::{Variant, VariantDecimal4};
+    ///  use parquet_variant::Variant;
    ///
    ///  // you can read an int64 variant into an u8


Suggested change

/// // you can read an int64 variant into an u8

/// // you can read an int8 variant into an u8

sdf-jkl · 2026-06-26T14:14:28Z

-    ///  use parquet_variant::{Variant, VariantDecimal4};
+    ///  use parquet_variant::Variant;
    ///
    ///  // you can read an int64 variant into an u16


same as above:

Suggested change

/// // you can read an int64 variant into an u16

/// // you can read an int16 variant into an u16

sdf-jkl · 2026-06-26T14:14:45Z

@@ -1111,40 +911,28 @@ impl<'m, 'v> Variant<'m, 'v> {
    ///  use parquet_variant::{Variant, VariantDecimal8};
    ///
    ///  // you can read an int64 variant into an u32


Suggested change

/// // you can read an int64 variant into an u32

/// // you can read an int32 variant into an u32

sdf-jkl · 2026-06-26T14:28:05Z

+    /// Returns `Some(u8)` for int8 variants that fit in `u8`,
    /// `None` for other variants or values that would overflow.


Suggested change

/// Returns `Some(u8)` for int8 variants that fit in `u8`,

/// `None` for other variants or values that would overflow.

/// Returns `Some(u8)` for a non-negative int8 variant, `None` otherwise.

Same case for as_u* below

sdf-jkl · 2026-06-26T15:22:08Z

+///
+/// This function is unlike `variant_to_unscaled_decim`, it would never rescale the decimal value,
+/// and only return the unscaled integer representation for the specific decimal variants.
+pub(crate) fn shred_variant_to_unscaled_decimal<O>(variant: &Variant<'_, '_>) -> Option<O::Native>


The function should check that Variant's scale matches the column scale. Otherwise Decimal(123, scale 3) shredded into a Decimal(.., scale 5) column would become .00123 instead of .123.

The question to rescale or not is also debatable.

I digged into spark and currently they rescale decimals and even cast to integers and vice versa:

// If true, we will shred decimals to a different scale or to integers, as long as they are // numerically equivalent. Similarly, integers will be allowed to shred to decimals.

More resources:

The rescale toggle they current set to True by default -
https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkShreddingUtils.scala#L699-L700

sparks decimal rescale logic - https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/common/variant/src/main/java/org/apache/spark/types/variant/VariantShreddingWriter.java#L199-L216

Thanks for pointing this out, I'll take a look at it soon

scovich · 2026-06-29T13:33:18Z

I'm still not on board with keeping as_u* APIs because they are not identity cast (since there's no unsigned Variant and we can't shred to unsigned) but let's leave it to @alamb / @scovich.

From what I understand, the as_xxx methods serve two purposes:

Shredding - as_uxx is useless
Convenience when e.g. picking apart the fields of a specific type that the code expects but where the underlying data may not match exactly. Here, as_uxx are very helpful, if that's what the "parsing" code expected.

@sdf-jkl it seems to me that the shredding code should simply not use the as_uxx methods, give that it's illegal to shred as unsigned types? Or is there some generic code that wrongly pulls them in automatically along with all the rest?

Orthogonally, there's the question of whether to cast at all, and if so how aggressively. This is something we debated at length in the early days, and I had called out that there are four possible levels of aggressiveness:

Type-based lossless (e.g. i8 -> i64 always succeeds)
Value-based lossless (e.g. i64 -> i8 may succeed, for values in -128..128)
Lossy (e.g. f64 -> i64 will truncate/round all trailing decimal digits)
Converting (e.g. string -> i64 parses a string, i64 -> string)

I had originally argued for 1/ only, because it is the least surprising. But 2/ is useful for shredding and many people anyway expect it because their system does implicit casting. 3/ is where things start to get questionable (information loss happens too easily IMO). An explicit cast operation can reasonably do both 3/ and 4/, because the user explicitly opted in by requesting the cast.

So we'll need to decide which level of aggressiveness variant does internally, vs. what requires a cast.

scovich · 2026-06-29T14:24:38Z

Ah, after re-reading I realize there's a fifth level of aggressiveness (0/ - no type changes whatsoever), and that's the proposed new behavior of as_xxx methods? In that case it does make sense to eliminate as_uxx methods because as you say those don't exist in the underlying variant and so would always return None. They only make sense if we choose 1/ or 2/.

One reason for choosing 1/ or 2/ is that it increases the effectiveness of shredding. For example, when shredding as i32 it's annoying for small i64 values to not shred when they fit in i32, and even more annoying for i8 and i16 values to not shred (they always fit in i32). Especially when you observe that variant is really just JSON, and JSON just has a generic numeric type. It's completely implementation-defined which of the "exact numeric" equivalence class types the JSON literal 42 parses to (int8, int16, int32, int64, decimal4, decimal8, or decimal16). And the variant spec state that:

The Equivalence Class column [of the Variant primitive types table] indicates logical equivalence of physically encoded types. For example, a user expression operating on a string value containing "hello" should behave the same, whether it is encoded with the short string optimization, or long string encoding. Similarly, user expressions operating on an int8 value of 1 should behave the same as a decimal16 with scale 2 and unscaled value 100.

Furthermore, the JSON spec encourages implementations to treat numeric types as IEEE 754 double precision values, so we have to consider double as well:

This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754-2008 binary64 (double precision) numbers is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision.
...
Note that when such software is used, numbers that are integers and are in the range [-(2**53)+1, (2**53)-1] are interoperable in the sense that implementations will agree exactly on their numeric values.

(I'm not sure why/when we'd ever see a float value, but we may as well handle it if we anyway have to handle double)

All in all, the above suggests that as_xxx methods should take approach 2/ in order to maximize conformance with both variant and JSON specs while also minimizing surprise to users. But there's plenty of room for varying opinions, e.g.:

What, exactly, counts as a "user expression" in the variant spec
Whether we really care about the JSON spec's suggested 53 bits of precision, when variant specifically includes a 256-bit decimal type

alamb · 2026-06-29T14:26:22Z

I haven't read this in detail, but in general as long as there is some reasonable way of getting the same behavior as the current as_xxx methods, I think it would be fine to deprecate them

I think we wrote them before the shredding logic existed and just assumed they would be useful, but if they are more confusing than useful then removing them sounds good to me

sdf-jkl · 2026-06-30T16:56:51Z

I think shred_variant doesn't have to use parquet-variant - as_* identity APIs.

shred_variant should have it's own match arms for cast semantics for specific types.

Identity APIs for simple types and special casts from parquet-variant-compute to handle cases where multiple physical types share an Equivalence Class

So, we can do 1/2 for actual shred semantics and 0 for parquet-variant crate.

klion26 · 2026-07-01T12:15:19Z

Sorry for the late reply, a little busy these days on internal work.

Value-based lossless (e.g. i64 -> i8 may succeed, for values in -128..128)

@scovich For the second type, there is some case may be trick: for float64 -> float32, not all float64 in [float32::MIN, float32::MAX] can convert to float32 loselesse, this makes the semantics tricky, such as 3f64 and 4f64 can convert to f32 losslesse, but f64::PI can't convert to f32 lossless, this makes the user a little difficult(not as ease as i64 -> i8) to expect which f64 will convert to f32, this is the reason why I perfer to identify API for as_xxx API

If the value-based lossless conversion for f64 -> f32 is receivable for us, then we can implement the Variant::as_xxx using the second approach(value-based lossless conversion).

scovich · 2026-07-01T13:05:25Z

Sorry for the late reply, a little busy these days on internal work.

Value-based lossless (e.g. i64 -> i8 may succeed, for values in -128..128)

@scovich For the second type, there is some case may be trick: for float64 -> float32, not all float64 in [float32::MIN, float32::MAX] can convert to float32 loselesse, this makes the semantics tricky, such as 3f64 and 4f64 can convert to f32 losslesse, but f64::PI can't convert to f32 lossless, this makes the user a little difficult(not as ease as i64 -> i8) to expect which f64 will convert to f32, this is the reason why I perfer to identify API for as_xxx API

If the value-based lossless conversion for f64 -> f32 is receivable for us, then we can implement the Variant::as_xxx using the second approach(value-based lossless conversion).

I'm not worried about supporting f64 -> f32 conversions. Almost all values will suffer a lossy conversion (as you point out). Also, I don't know of (m)any JSON parsers that choose f32 when they see a non-integral number. They all choose f64.

I do worry about not supporting i8 -> i64 conversions because many parsers -- including our very own arrow-rs json-to-variant parser -- choose the narrowest type that can represent a given number. So a column with values [1, 1000, 100_000, 10_000_000_000, 100_000_000_000_000_000_000] would variant-parse as [i8, i16, i32, i64, f64] (we don't support parsing decimal values from json yet). And the problem can't be fixed by e.g. cast to i64, because that would truncate (or fail) the f64 value that doesn't fit. The whole point of variant is to accommodate "wrong" type values, even when shredding.

Similarly, I worry a little (tho less) about supporting i64 -> i8 conversions because somebody may want to shred a column as a narrow type, knowing that most values are small and that any outliers will stay behind in the value column. Again, cast to i8 wouldn't work because it would truncate/fail the outliers.

scovich · 2026-07-01T13:10:24Z

I think shred_variant doesn't have to use parquet-variant - as_* identity APIs.

shred_variant should have it's own match arms for cast semantics for specific types.

Fair point. But (a) we'd still need the lossless casting logic somewhere, if we choose anything other than identity semantics; and (b) the general cast kernel isn't a good fit for variant's semantics where outliers and wrong-type values are not just expected but embraced.

I guess the question might be: How much support should arrow-rs give to variant users who care about losslessly preserving heterogeneous values, outside of shredding?

github-actions Bot added the parquet-variant parquet-variant* crates label Jun 18, 2026

klion26 force-pushed the fix_variant_shred branch from 32a3f09 to aca457f Compare June 18, 2026 06:39

[Variant] Fix the variant shred logic

003214c

Previously, variant_shred and variant_get reuse the same code path, causing type mismatches (e.g., bool was incorrectly shredded as int). This commit separates the shred path and ensures each type is converted correctly.

klion26 force-pushed the fix_variant_shred branch from aca457f to 003214c Compare June 18, 2026 06:55

klion26 commented Jun 18, 2026

View reviewed changes

sdf-jkl reviewed Jun 22, 2026

View reviewed changes

klion26 commented Jun 24, 2026

View reviewed changes

address comments

55c4a7d

klion26 force-pushed the fix_variant_shred branch from 5032777 to 55c4a7d Compare June 24, 2026 09:47

klion26 force-pushed the fix_variant_shred branch from a2c254c to 03060f9 Compare June 25, 2026 11:36

change all Variant::as_xx to identity function

d370f9b

klion26 force-pushed the fix_variant_shred branch from 03060f9 to d370f9b Compare June 25, 2026 11:45

klion26 commented Jun 25, 2026

View reviewed changes

sdf-jkl reviewed Jun 26, 2026

View reviewed changes

	/// // you can read an int64 variant into an u8
	/// // you can read an int8 variant into an u8

	/// // you can read an int64 variant into an u16
	/// // you can read an int16 variant into an u16

	/// // you can read an int64 variant into an u32
	/// // you can read an int32 variant into an u32

		/// Returns `Some(u8)` for int8 variants that fit in `u8`,
		/// `None` for other variants or values that would overflow.

	/// Returns `Some(u8)` for int8 variants that fit in `u8`,
	/// `None` for other variants or values that would overflow.
	/// Returns `Some(u8)` for a non-negative int8 variant, `None` otherwise.

Uh oh!

Conversation

klion26 commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

klion26 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdf-jkl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

klion26 left a comment

Choose a reason for hiding this comment

Uh oh!

klion26 commented Jun 25, 2026

Uh oh!

sdf-jkl commented Jun 25, 2026

Uh oh!

klion26 commented Jun 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdf-jkl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scovich commented Jun 29, 2026

Uh oh!

alamb commented Jun 29, 2026

Uh oh!

sdf-jkl commented Jun 30, 2026

Uh oh!

klion26 commented Jul 1, 2026

Uh oh!

scovich commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scovich commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

klion26 commented Jun 18, 2026 •

edited

Loading

scovich commented Jun 29, 2026 •

edited

Loading

scovich commented Jul 1, 2026 •

edited

Loading

scovich commented Jul 1, 2026 •

edited

Loading