Spark: Support writing shredded variant in Iceberg-Spark#14297
Spark: Support writing shredded variant in Iceberg-Spark#14297aihuaxu wants to merge 32 commits intoapache:mainfrom
Conversation
16b7a09 to
dc4f72e
Compare
97851f0 to
b87e999
Compare
|
@amogh-jahagirdar @Fokko @huaxingao Can you help take a look at this PR and if we have better approach for this? |
|
cc @RussellSpitzer, @pvary and @rdblue Seems it's better to have the implementation with new File Format proposal but want to check if this is acceptable approach as an interim solution or you see a better alternative. |
|
@aihuaxu: Don't we want to do the same but instead of wrapping the Would this be prohibitively complex? |
|
In Spark DSv2, planning/validation happens on the driver. For shredded variant, we don’t know the shredded schema at planning time. We have to inspect some records to derive it. Doing a read on the driver during Because of that, the current proposed Spark approach is: put the logical variant in the writer factory, on the executor, buffer the first N rows, infer the shredded schema from data, then initialize the concrete writer and flush the buffer. I believe this PR follow the same approach, which seems like a practical solution to me given DSV2's constraints. |
|
Thanks for the explanation, @huaxingao! I see several possible workarounds for the DataWriterFactory serialization issue, but I have some more fundamental concerns about the overall approach. Even if we accept that the written data should dictate the shredding logic, Spark’s implementation—while dependent on input order—is at least somewhat stable. It drops rarely used fields, handles inconsistent types, and limits the number of columns. |
|
Thanks @huaxingao and @pvary for reviewing, and thanks to Huaxin for explaining how the writer works in Spark. Regarding the concern about unstable schemas, Spark's approach makes sense:
We could implement similar heuristics. Additionally, making the shredded schema configurable would allow users to choose which fields to shred at write time based on their read patterns. For this POC, I'd like any feedback on whether there are any significant high-level design options to consider first and if this approach is acceptable. This seems hacky. I may have missed big picture on how the writers work across Spark + Iceberg + Parquet and we may have better way. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This PR caught my eye, as I've implemented the equivalent in DuckDB: duckdb/duckdb#19336 The PR description doesn't give much away, but I think the approach is similar to the proposed (interim) solution here: buffer the first rowgroup, infer the shredded schema from this, then finalize the file schema and start writing data. We've opted to create a We've also added a copy option to force the shredded schema, for debugging purposes and for power users. As for DECIMAL, it's kind of a special case in the shredding inference. We only shred on a DECIMAL type if all the decimal values we've seen for a column/field have the same width+scale, if any decimal value differs, DECIMAL won't be considered anymore when determining the shredded type of the column/field |
|
This PR is super exciting! Regarding the heuristics - I'd like to propose adding table properties as hints for variant shredding. |
That is correct.
I'm still trying to improve the heuristics to use the most common one as shredding type rather than the first one and probably cap the number of shredded fields, etc. but it doesn't need 100% consistent type to be shredded.
Yeah. I think that makes sense for advanced user to determine the shredded schema since they may know the read pattern.
Why is DECIMAL special here? If we determine DECIMAL4 to be shredded type, then we may shred as DECIMAL4 or not shred if they cannot fit in DECIMAL4, right? |
Yeah. I'm also thinking of that too. Will address that separately. Basically based on read pattern, the user can specify the shredding schema. |
gkpanda4
left a comment
There was a problem hiding this comment.
When processing JSON objects containing null field values (e.g., {"field": null}), the variant shredding creates schema columns for these null fields instead of omitting them entirely. This would cause schema bloat.
Adding a null check in ParquetVariantUtil.java:386 in the object() method should fix it.
2e81d79 to
7e1b608
Compare
I addressed this null value check in VariantShreddingAnalyzer.java instead. If it's NULL, then we will not add the shredded field. |
7c805f6 to
67dbe97
Compare
Co-authored-by: Neelesh Salian <n_salian@apple.com> Co-authored-by: Aihua Xu <aihuaxu@gmail.com>
|
I’m fine with the core and Parquet changes. For the actual VariantShredding and Spark parts, I’d prefer to defer to others who have more experience with that area of the code. |
| * <p>Determinism contract: for a given set of variant values (regardless of row arrival order), | ||
| * this analyzer produces the same shredded schema. |
There was a problem hiding this comment.
Nit: the determinism of the schredd schema does not hold when field caps/pruning apply (>MAX_INTERMEDIATE_FIELDS), right? Tight the javadoc
| entry -> { | ||
| FieldInfo info = entry.getValue().info; | ||
| return info != null | ||
| && ((double) info.observationCount / totalRows) < MIN_FIELD_FREQUENCY; |
There was a problem hiding this comment.
Nit: under lists, observationCount is per element visit, not per row. With pruning using the root row count as the denominator, a few long lists can be enough to keep a nested field from being pruned even if the list key appears in few rows. Can you document this explicitly?
There was a problem hiding this comment.
Can you also add new test to validate the pruning behavior: Long array in a few rows survives the pruning?
There was a problem hiding this comment.
Will add a test and update the documentation.
…WRITER_VERSION_KEY)
…oc string, decimal fallback tests
|
@pvary @aihuaxu @huaxingao PTAL. Latest additions: I added a DecimalPrimitiveWriter to handle decimal scale and precision values at write time to make sure it works well. Added a couple of tests too. |
The change looks good to me. |
|
+1 from my side. |
steveloughran
left a comment
There was a problem hiding this comment.
Commenting as now I'm implementing rowgroup skipping there are more details that I can appreciate.
Shredding can explode schemas with impact at query time.
Priority should go to field types whose min/max stats permit rowgroup skipping on equality queries, and equals/less equals
That's assuming queries where its things like where deviceid=afb334, eventtime > 1899 or failures < 10.
UUIDs and arbitrary BINARY seem of limited value there as the stats are of little or no use.
| .as(LogicalTypeAnnotation.timestampType(false, LogicalTypeAnnotation.TimeUnit.NANOS)) | ||
| .named(TYPED_VALUE); | ||
| case DECIMAL4, DECIMAL8, DECIMAL16 -> createDecimalTypedValue(info); | ||
| case UUID -> |
There was a problem hiding this comment.
really curious about the benefits of shredding this. Any use case where the UUID is unique on every row has no compression, and the min/max stats are useless too.
| PhysicalType.DECIMAL8, 1, | ||
| PhysicalType.DECIMAL16, 2); | ||
|
|
||
| private static final Map<PhysicalType, Integer> TIE_BREAK_PRIORITY = |
There was a problem hiding this comment.
explain role in javadocs. And if it is "what to prefer to shred", I'd put STRING above BINARY as string compares char sequences;
binary just looks at bytes and then length.
IF I was being ruthless I'd put binary below nano timestamps.
It's just that I'd expect a lot more strings to go in than binaries, and in future those nano timestamps.
|
|
||
| @Override | ||
| public ModelWriteBuilder<D, S> set(String property, String value) { | ||
| if (SHRED_VARIANTS_KEY.equals(property)) { |
There was a problem hiding this comment.
what about a switch statement here?
| Map<Integer, Type> shreddedTypes = | ||
| variantAnalyzer.analyzeVariantColumns(bufferedRows, schema, engineSchema); | ||
|
|
||
| if (!shreddedTypes.isEmpty()) { |
There was a problem hiding this comment.
I think it would be prescient to add some logging at debug level here.
| | write.parquet.dict-size-bytes | 2097152 (2 MB) | Parquet dictionary page size | | ||
| | write.parquet.compression-codec | zstd | Parquet compression codec: zstd, brotli, lz4, gzip, snappy, uncompressed | | ||
| | write.parquet.compression-level | null | Parquet compression level | | ||
| | write.parquet.variant.shred | false | When true, variant columns are written with shredded Parquet encoding for improved query performance | |
There was a problem hiding this comment.
"for better compression and possibly also improved query performance".
Compression holds, but query perf needs
- ongoing work through the stack
- queries actually applied to the shredded values
| private static final String VALUE = "value"; | ||
| private static final String ELEMENT = "element"; | ||
| private static final double MIN_FIELD_FREQUENCY = 0.10; | ||
| private static final int MAX_SHREDDED_FIELDS = 300; |
There was a problem hiding this comment.
This is going to explode schema complexity; I can see on my query work that this has an impact on Rowgroup skipping as there's more overhead on parsing this ... I'm trying hard to optimise it w/ lazy compute, but it's still expensive.
any record with, 2+ variant objects could now have 600+ child elements.
There was a problem hiding this comment.
Maybe we can make those shredding params configurable in the future, or after more performance testing?
|
I looked at VariantShreddingAnalyzer and SparkVariantShreddingAnalyzer, implementation looks good, just minor nit. The current strategy is to shred aggressively, including fields with multiple incompatible types by picking the most common one. When a field has mixed types, the shredded typed_value is only populated for rows whose value matches the chosen type; other rows still carry the full binary value. This means bounded column reads are not available for mixed-type fields, and the performance gain relative to the added column overhead is not clear. I am not suggesting the current design is flawed. Shredding parameters like MIN_FIELD_FREQUENCY and MAX_SHREDDED_FIELDS can be tuned or new strategies introduced in follow-ups without breaking existing files. But more performance testing on real query patterns would help inform whether these thresholds need to be user-tunable. I would not block merge on this, assuming the community agrees. |
| private static final String VALUE = "value"; | ||
| private static final String ELEMENT = "element"; | ||
| private static final double MIN_FIELD_FREQUENCY = 0.10; | ||
| private static final int MAX_SHREDDED_FIELDS = 300; |
There was a problem hiding this comment.
Maybe we can make those shredding params configurable in the future, or after more performance testing?
|
|
||
| private static class PathNode { | ||
| private final String fieldName; | ||
| private final Map<String, PathNode> objectChildren = Maps.newTreeMap(); |
There was a problem hiding this comment.
Performance nit: this map is on the hot path, tree map requires string comparison. Maybe change it to hashmap since ordering is not important until after pruning createObjectTypedValue? We do sort there:
// createObjectTypedValue: sort once here
private static Type createObjectTypedValue(PathNode node) {
List<PathNode> sorted = Lists.newArrayList(node.objectChildren.values());
sorted.sort(Comparator.comparing(child -> child.fieldName));
...
}
|
Thanks for the reviews @steveloughran @qlong - all great points. I'd like to land this PR as-is and I can follow up with a PR to address these since the PR is already large. I summarized here:
None of these affect correctness. Happy to open the follow-up immediately after merge if there is agreement. |
qlong
left a comment
There was a problem hiding this comment.
I focused on shredding analyzer and it looks good to me
| public void close() throws IOException { | ||
| if (!closed) { | ||
| if (delegate == null && buffer != null && !buffer.isEmpty()) { | ||
| initialize(); |
There was a problem hiding this comment.
If initialize() or delegate.close() throws, closed and buffer never get updated. Should we wrap in try/finally so closed and buffer always get cleaned?
There was a problem hiding this comment.
Good call out, will add this.
| "write.delete.parquet.compression-level"; | ||
| public static final String PARQUET_COMPRESSION_LEVEL_DEFAULT = null; | ||
|
|
||
| public static final String PARQUET_VARIANT_SHRED = "write.parquet.variant.shred"; |
There was a problem hiding this comment.
nit: other properties in this file use write.parquet.<name> with a single hyphenated terminal (e.g. compression-codec, dict-size-bytes), maybe write.parquet.shred-variants?
| public static final String PARQUET_VARIANT_SHRED = "write.parquet.variant.shred"; | ||
| public static final boolean PARQUET_VARIANT_SHRED_DEFAULT = false; | ||
| public static final String PARQUET_VARIANT_BUFFER_SIZE = | ||
| "write.parquet.variant.inference.buffer-size"; |
There was a problem hiding this comment.
nit: write.parquet.variant-inference-buffer-size?
| // Controls the buffer size for variant schema inference during writes | ||
| // This determines how many rows are buffered before inferring shredded schema | ||
| public static final String VARIANT_INFERENCE_BUFFER_SIZE = | ||
| "spark.sql.iceberg.variant.inference.buffer-size"; |
There was a problem hiding this comment.
nit: spark.sql.iceberg.variant-inference-buffer-size?
| extends BaseFormatModel<D, S, ParquetValueWriter<?>, R, MessageType> { | ||
| public static final String SHRED_VARIANTS_KEY = TableProperties.PARQUET_VARIANT_SHRED; | ||
| public static final String VARIANT_BUFFER_SIZE_KEY = TableProperties.PARQUET_VARIANT_BUFFER_SIZE; | ||
| public static final int DEFAULT_BUFFER_SIZE = TableProperties.PARQUET_VARIANT_BUFFER_SIZE_DEFAULT; |
There was a problem hiding this comment.
nit: Do we need to add these constants? Shall we use TableProperties.PARQUET_VARIANT_SHRED / PARQUET_VARIANT_BUFFER_SIZE / PARQUET_VARIANT_BUFFER_SIZE_DEFAULT directly? Otherwise we now have two names for the same thing on the public API surface.
| // scale=min(20, 38-30)=8 (integer digits get priority) | ||
| VariantMetadata meta = Variants.metadata("val"); | ||
| ShreddedObject row1 = Variants.object(meta); | ||
| row1.put("val", Variants.of(new java.math.BigDecimal("123456789012345678901234567890"))); |
There was a problem hiding this comment.
nit: import java.math.BigDecimal
| "SELECT id, variant_get(address, '$.val', 'decimal(10,4)') FROM %s ORDER BY id", | ||
| tableName); | ||
| assertThat(rows).hasSize(6); | ||
| assertThat(rows.get(0)[1]).isEqualTo(new java.math.BigDecimal("123.4500")); |
There was a problem hiding this comment.
nit: import java.math.BigDecimal?
| String path = task.file().location(); | ||
|
|
||
| HadoopInputFile inputFile = | ||
| HadoopInputFile.fromPath(new org.apache.hadoop.fs.Path(path), new Configuration()); |
There was a problem hiding this comment.
nit: import org.apache.hadoop.fs.Path?
|
|
||
| public int variantInferenceBufferSize() { | ||
| return confParser | ||
| .intConf() |
There was a problem hiding this comment.
should we add .option(...) too? Right now shredVariants() can be overridden per-write but variantInferenceBufferSize() can't.
|
Will address @huaxingao's comments in an upcoming commit. I also realized that this PR was originally only on Spark 4.1. I'll can add the changes to Spark 4.0 too. Or should I do that in a follow up PR after this is merged?
|
What it does
This PR adds support for writing shredded variants from Spark into Iceberg tables. Variant shredding extracts commonly-typed fields from semi-structured VARIANT columns into dedicated typed Parquet columns (typed_value), enabling predicate pushdown, column pruning, and better read performance.
Key design: Buffered schema inference
Because the shredded schema isn't known at Spark's planning time (DSv2 creates DataWriterFactory on the driver before seeing data), the PR uses a lazy/buffered approach:
Shredding heuristics
Co-Authored by: @nssalian