[Variant] Fix the variant shred logic#10157
Conversation
32a3f09 to
aca457f
Compare
Previously, variant_shred and variant_get reuse the same code path, causing type mismatches (e.g., bool was incorrectly shredded as int). This commit separates the shred path and ensures each type is converted correctly.
aca457f to
003214c
Compare
| impl_primitive_from_variant!(datatypes::Float32Type, as_f32); | ||
| impl_primitive_from_variant!(datatypes::Float64Type, as_f64); | ||
| impl_primitive_from_variant!(datatypes::Date32Type, as_naive_date, |v| { | ||
| enum NumericKind { |
There was a problem hiding this comment.
These code are mainly moved from Variant.rs, the change here was adding the parameter shred for from_variant and variant_to_unscaled_decimal.
sdf-jkl
left a comment
There was a problem hiding this comment.
Hey @klion26 I took a look.
Overall I like the approach but I think we can improve it.
To follow more closely the existing split between arrow-array and arrow-cast we can only keep identity accessors for just the specific type without widening/narrowing. (i.e. Variant::as_i32 only if the Variant is i32) (it'd also address the lossy cast comment below)
And as for switching between shred_variant and variant_get cast semantics, we could pass them through variant_cast_with_options and add an argument for cast semantics we use { shred, get }
| /// [specification]: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md | ||
| /// [Variant Shredding specification]: https://github.com/apache/parquet-format/blob/master/VariantShredding.md | ||
| /// | ||
| /// # Casting Semantics |
There was a problem hiding this comment.
I originally thought it's a wrong idea to decouple cast from parquet-variant crate, but now I think it's very similar to how arrow-array and arrow-cast are decoupled.
| cast_options, | ||
| capacity, | ||
| NullValue::ArrayElement, | ||
| false, |
There was a problem hiding this comment.
| false, | |
| true, |
if we keep false here it preserves the old too lenient cast semantics.
The branch explicitly checks if shredded is true.
Can check with (doesn't pass):
// `VariantToListArrowRowBuilder::try_new` hardcodes `shred = false` for its element
// builder (see variant_to_arrow.rs). The struct-field path threads `shred` correctly, so this
// asymmetry is invisible without a nested-list test.
#[test]
fn test_array_shredding_does_not_cast_bool_element_to_int() {
let input = build_variant_array(vec![VariantRow::List(vec![
VariantValue::from(1i64),
VariantValue::from(true),
VariantValue::from(2i64),
])]);
let list_schema = DataType::List(Arc::new(Field::new("item", DataType::Int64, true)));
let result = shred_variant(&input, &list_schema).unwrap();
assert_eq!(result.len(), 1);
assert_list_structure_and_elements::<Int64Type, i32>(
&result,
1,
&[0, 3],
&[Some(3)],
&[None],
(
// element 1 (the bool) must NOT be shredded into typed_value ...
&[Some(1), None, Some(2)],
// ... it must be preserved verbatim in the residual `value` instead.
&[None, Some(Variant::BooleanTrue), None],
),
);
}There was a problem hiding this comment.
As, thanks for pointing this out, this was my bad, I used false as a placeholder, but did not change me all.
| cast_options, | ||
| capacity, | ||
| NullValue::ArrayElement, | ||
| false, |
There was a problem hiding this comment.
| false, | |
| true, |
same as above
| self.as_num() | ||
| match *self { | ||
| Variant::Float(i) => Some(i), | ||
| Variant::Double(i) => Some(i as f32), |
There was a problem hiding this comment.
Should shred do lossy narrowing f64 to f32?
klion26
left a comment
There was a problem hiding this comment.
@sdf-jkl Thanks for the review, I've addressed the comments, please take another look when you're free.
For the identity get for Variant::as_xx, I have thought that, but keep it as now, because the underlying physical data is the same(don't do lossless conversion), but I'm open for this question.
For the switching between shred_variant and variant_get, I left the switch logic to the definition of the builder -- here we know how to do the conversion better, variant_cast_with_options seems like a wrapper of them, so I keep as it for now.
5032777 to
55c4a7d
Compare
|
After another look of the code, I perfer the identity function for After changing to the identity function for |
|
If we go identity functions for We don't shred to unsigned types and the only path to use it would be cast in |
a2c254c to
03060f9
Compare
03060f9 to
d370f9b
Compare
|
I've update the code,
but don't remove the |
| } | ||
| } | ||
| ); | ||
| impl_primitive_from_variant!( |
There was a problem hiding this comment.
Align the cast logic with arrow-cast for Time related variant will in a followup pr
| /// use parquet_variant::{Variant, VariantDecimal4}; | ||
| /// use parquet_variant::Variant; | ||
| /// | ||
| /// // you can read an int64 variant into an u8 |
There was a problem hiding this comment.
| /// // you can read an int64 variant into an u8 | |
| /// // you can read an int8 variant into an u8 |
| /// use parquet_variant::{Variant, VariantDecimal4}; | ||
| /// use parquet_variant::Variant; | ||
| /// | ||
| /// // you can read an int64 variant into an u16 |
There was a problem hiding this comment.
same as above:
| /// // you can read an int64 variant into an u16 | |
| /// // you can read an int16 variant into an u16 |
| @@ -1111,40 +911,28 @@ impl<'m, 'v> Variant<'m, 'v> { | |||
| /// use parquet_variant::{Variant, VariantDecimal8}; | |||
| /// | |||
| /// // you can read an int64 variant into an u32 | |||
There was a problem hiding this comment.
| /// // you can read an int64 variant into an u32 | |
| /// // you can read an int32 variant into an u32 |
| /// Returns `Some(u8)` for int8 variants that fit in `u8`, | ||
| /// `None` for other variants or values that would overflow. |
There was a problem hiding this comment.
| /// Returns `Some(u8)` for int8 variants that fit in `u8`, | |
| /// `None` for other variants or values that would overflow. | |
| /// Returns `Some(u8)` for a non-negative int8 variant, `None` otherwise. |
Same case for as_u* below
| /// | ||
| /// This function is unlike `variant_to_unscaled_decim`, it would never rescale the decimal value, | ||
| /// and only return the unscaled integer representation for the specific decimal variants. | ||
| pub(crate) fn shred_variant_to_unscaled_decimal<O>(variant: &Variant<'_, '_>) -> Option<O::Native> |
There was a problem hiding this comment.
The function should check that Variant's scale matches the column scale. Otherwise Decimal(123, scale 3) shredded into a Decimal(.., scale 5) column would become .00123 instead of .123.
The question to rescale or not is also debatable.
I digged into spark and currently they rescale decimals and even cast to integers and vice versa:
// If true, we will shred decimals to a different scale or to integers, as long as they are
// numerically equivalent. Similarly, integers will be allowed to shred to decimals.
More resources:
- The rescale toggle they current set to
Trueby default -
https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkShreddingUtils.scala#L699-L700 - sparks decimal rescale logic - https://github.com/apache/spark/blob/c8695495569e4058a758b111ebc942fc4906494c/common/variant/src/main/java/org/apache/spark/types/variant/VariantShreddingWriter.java#L199-L216
There was a problem hiding this comment.
Thanks for pointing this out, I'll take a look at it soon
From what I understand, the
@sdf-jkl it seems to me that the shredding code should simply not use the Orthogonally, there's the question of whether to cast at all, and if so how aggressively. This is something we debated at length in the early days, and I had called out that there are four possible levels of aggressiveness:
I had originally argued for 1/ only, because it is the least surprising. But 2/ is useful for shredding and many people anyway expect it because their system does implicit casting. 3/ is where things start to get questionable (information loss happens too easily IMO). An explicit So we'll need to decide which level of aggressiveness variant does internally, vs. what requires a |
|
Ah, after re-reading I realize there's a fifth level of aggressiveness (0/ - no type changes whatsoever), and that's the proposed new behavior of One reason for choosing 1/ or 2/ is that it increases the effectiveness of shredding. For example, when shredding as i32 it's annoying for small i64 values to not shred when they fit in i32, and even more annoying for i8 and i16 values to not shred (they always fit in i32). Especially when you observe that variant is really just JSON, and JSON just has a generic numeric type. It's completely implementation-defined which of the "exact numeric" equivalence class types the JSON literal
Furthermore, the JSON spec encourages implementations to treat numeric types as IEEE 754 double precision values, so we have to consider
(I'm not sure why/when we'd ever see a All in all, the above suggests that
|
|
I haven't read this in detail, but in general as long as there is some reasonable way of getting the same behavior as the current I think we wrote them before the shredding logic existed and just assumed they would be useful, but if they are more confusing than useful then removing them sounds good to me |
|
I think
Identity APIs for simple types and special casts from So, we can do 1/2 for actual shred semantics and 0 for |
|
Sorry for the late reply, a little busy these days on internal work.
@scovich For the second type, there is some case may be trick: for float64 -> float32, not all float64 in If the value-based lossless conversion for |
I'm not worried about supporting f64 -> f32 conversions. Almost all values will suffer a lossy conversion (as you point out). Also, I don't know of (m)any JSON parsers that choose f32 when they see a non-integral number. They all choose f64. I do worry about not supporting i8 -> i64 conversions because many parsers -- including our very own arrow-rs json-to-variant parser -- choose the narrowest type that can represent a given number. So a column with values Similarly, I worry a little (tho less) about supporting i64 -> i8 conversions because somebody may want to shred a column as a narrow type, knowing that most values are small and that any outliers will stay behind in the |
Fair point. But (a) we'd still need the lossless casting logic somewhere, if we choose anything other than identity semantics; and (b) the general I guess the question might be: How much support should arrow-rs give to variant users who care about losslessly preserving heterogeneous values, outside of shredding? |
Which issue does this PR close?
What changes are included in this PR?
variant_shredandvariant_getby introduceshredflag for row builderVariant::as_xxxtotype_conversion, and these moved code will be used whenvariant_get,variant_shredwill useVariant::as_xxxVariant::as_xx()is identity function now,Variant::Int8can only be treated asint8, but notint16/int32/int64, etc.Variant::as_f16test_variant_type_shredded_correctlyAre these changes tested?
Yes, added some tests to cover the logic
Are there any user-facing changes?
Yes, some
Variant::as_xxlogic have been changed.