[AMORO-4122][ams] Fix partition filter not working due to @JsonIgnore on PendingInput.partitions#4178
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4178 +/- ##
============================================
- Coverage 29.72% 22.63% -7.10%
+ Complexity 4252 2612 -1640
============================================
Files 677 461 -216
Lines 54722 42556 -12166
Branches 6964 6003 -961
============================================
- Hits 16268 9632 -6636
+ Misses 37245 32087 -5158
+ Partials 1209 837 -372
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
… on PendingInput.partitions The partitions field in AbstractOptimizingEvaluator.PendingInput was annotated with @JsonIgnore, causing partition information to be lost during JSON serialization/deserialization through TableRuntimeStore. This made the partition filter in OptimizingPlanner always fall back to scanning all partitions. Fix: add a serializable partitionPaths field (Set<String>) that stores partition info as 'specId:path' strings. On serialization, build paths from the in-memory partitions map; on deserialization, rebuild the partitions map from paths using DataFiles.data().
ea2502a to
181f07d
Compare
| struct = TablePropertyUtil.EMPTY_STRUCT; | ||
| } else { | ||
| struct = DataFiles.data(spec, partitionPath); | ||
| } |
There was a problem hiding this comment.
Using org.apache.iceberg.DataFiles.data() here will break partition rebuild for Mixed-format tables. Iceberg's native DataFiles.data() delegates to Conversions.fromPartitionString(), which only handles standard Iceberg type conversions. It does not support Amoro's custom partition transforms:
monthtransform: Mixed-format tables serialize month partitions asyyyy-MM(e.g.,month=2024-01). Iceberg's parser expects an integer month offset, not a date string, and will throwNumberFormatException.hourtransform: Similarly serialized asyyyy-MM-dd-HH, which Iceberg cannot parse.- Hive null sentinel:
__HIVE_DEFAULT_PARTITION__is not handled by Iceberg's parser.
Amoro already provides MixedDataFiles.data(PartitionSpec, String) in org.apache.amoro.utils.MixedDataFiles (line 110) that correctly handles all these cases via its own fromPartitionString().
The catch(Exception) block below will silently swallow these parse failures, logging only a warning while leaving partitions empty — producing the same Expressions.alwaysTrue() fallback as the original bug. This effectively makes the fix a no-op for Mixed-format tables with custom transforms.
Suggested fix:
import org.apache.amoro.utils.MixedDataFiles;
// ...
struct = MixedDataFiles.data(spec, partitionPath);Also, please consider adding test cases for:
- Mixed-format tables with
month/hourpartition transforms (round-trip serialization) - Unpartitioned tables (ensure
EMPTY_STRUCTpath works correctly)
…ion transforms Replace Iceberg's DataFiles.data() with Amoro's MixedDataFiles.data() in rebuildPartitions() to correctly handle Mixed-format custom partition transforms (month, hour) and Hive null sentinel. Co-authored-by: czy006 <czy006@users.noreply.github.com>
Tests for MixedDataFiles.data() round-trip with: - Month partition transform (yyyy-MM format) - Hour partition transform (yyyy-MM-dd-HH format) - Unpartitioned tables (EMPTY_STRUCT path) - Hive null sentinel (__HIVE_DEFAULT_PARTITION__) - Empty PendingInput Co-authored-by: czy006 <czy006@users.noreply.github.com>
|
cc @zhoujinsong |
Why are the changes needed?
Close #4122
The
partitionsfield inAbstractOptimizingEvaluator.PendingInputis annotated with@JsonIgnore, which causes partition information to be silently dropped during JSON serialization/deserialization throughTableRuntimeStore(stored as CLOB in the database).When
PendingInputis read back from the database, thepartitionsmap is always empty. This causes the partition filter inIcebergTableUtil.createOptimizingPlanner()to fall back toExpressions.alwaysTrue(), meaning all partitions are scanned instead of only the ones that need optimizing.Fix
Add a serializable
partitionPathsfield (Set<String>) that stores partition info asspecId:partitionPathstrings. On serialization, build paths from the in-memory partitions map viaPartitionSpec.partitionToPath(); on deserialization, rebuild the partitions map from paths usingDataFiles.data().How was this patch tested?
TestIcebergTableUtil: 4 tests passedTestOptimizingEvaluator: 20 tests passedTestOptimizingPlanner: 20 tests passed