Conversation
added 3 commits
June 23, 2026 08:55
… prototype Revives the standalone core of stale-closed PR apache#3067 and extends it with REAL manifest-level bbox pruning (the original PR left pruning as no-op ROWS_MIGHT_MATCH stubs). - pyiceberg/utils/geospatial.py: WKB/EWKB (ISO + EWKB SRID/Z/M flags, both byte orders) envelope extraction, GeospatialBound serde (XY/XYZ/XYM/XYZM), antimeridian-aware geography longitude intervals, and a new GeospatialStatsAggregator that accumulates min/max bounds over a column on write. serialize_geospatial_bound now rejects a genuinely-NaN z with a present m (removes XYM sentinel ambiguity). - pyiceberg/utils/geospatial_pruning.py: bbox_might_match(predicate, query_wkb, lower_bound, upper_bound, is_geography) -> bool. Returns False ONLY when it is provably impossible for any row in the file to match (no false negatives = no silent data loss). Geometry uses planar x; geography uses circular longitude-interval overlap correct at the +-180 seam. Conservative epsilon on boundary comparisons absorbs geography circle round-trip drift. Adversarial review caught: (HIGH) +-180 antimeridian boundary wrongly pruned; (HIGH, found by fuzz) geography circle round-trip drift pruned a file's own edge points (~21% false-negative rate); (MED) weak negative tests; (MED) XYM NaN-z sentinel collision. All fixed with regression + property/fuzz tests asserting zero false negatives over 4000+ random files x 4 predicates. Standalone: stats/bound/predicate-pruning logic is independent of the T1 write gate (needed only to persist bounds into a real geo-table commit).
Geography longitude bounds were computed from the unordered set of vertex longitudes via a minimal enclosing arc, which could wrap the antimeridian in a way that excluded a real connecting edge. A point lying on an excluded edge (e.g. POINT(-60 0) on LINESTRING(-120 0, 0 0, 120 0)) was then pruned, losing matching rows. Compute the longitude span as the union of per-edge minor arcs between consecutive vertices instead, so connected edges are never dropped while isolated points and multipoints still wrap correctly. Also treat any non-finite stored bound coordinate as 'might match' in bbox pruning, since NaN comparisons would otherwise silently prune to False.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.