Skip to content

feat(expressions): geometry/geography stats, bounds, and bbox pruning#25

Open
abnobdoss wants to merge 3 commits into
mainfrom
v3/t3-geo
Open

feat(expressions): geometry/geography stats, bounds, and bbox pruning#25
abnobdoss wants to merge 3 commits into
mainfrom
v3/t3-geo

Conversation

@abnobdoss

Copy link
Copy Markdown
Owner

No description provided.

Abanoub Doss added 3 commits June 23, 2026 08:55
… prototype

Revives the standalone core of stale-closed PR apache#3067 and extends it with
REAL manifest-level bbox pruning (the original PR left pruning as no-op
ROWS_MIGHT_MATCH stubs).

- pyiceberg/utils/geospatial.py: WKB/EWKB (ISO + EWKB SRID/Z/M flags, both
  byte orders) envelope extraction, GeospatialBound serde (XY/XYZ/XYM/XYZM),
  antimeridian-aware geography longitude intervals, and a new
  GeospatialStatsAggregator that accumulates min/max bounds over a column on
  write. serialize_geospatial_bound now rejects a genuinely-NaN z with a
  present m (removes XYM sentinel ambiguity).
- pyiceberg/utils/geospatial_pruning.py: bbox_might_match(predicate, query_wkb,
  lower_bound, upper_bound, is_geography) -> bool. Returns False ONLY when it is
  provably impossible for any row in the file to match (no false negatives =
  no silent data loss). Geometry uses planar x; geography uses circular
  longitude-interval overlap correct at the +-180 seam. Conservative epsilon
  on boundary comparisons absorbs geography circle round-trip drift.

Adversarial review caught: (HIGH) +-180 antimeridian boundary wrongly pruned;
(HIGH, found by fuzz) geography circle round-trip drift pruned a file's own
edge points (~21% false-negative rate); (MED) weak negative tests; (MED) XYM
NaN-z sentinel collision. All fixed with regression + property/fuzz tests
asserting zero false negatives over 4000+ random files x 4 predicates.

Standalone: stats/bound/predicate-pruning logic is independent of the T1 write
gate (needed only to persist bounds into a real geo-table commit).
Geography longitude bounds were computed from the unordered set of vertex
longitudes via a minimal enclosing arc, which could wrap the antimeridian
in a way that excluded a real connecting edge. A point lying on an excluded
edge (e.g. POINT(-60 0) on LINESTRING(-120 0, 0 0, 120 0)) was then pruned,
losing matching rows. Compute the longitude span as the union of per-edge
minor arcs between consecutive vertices instead, so connected edges are never
dropped while isolated points and multipoints still wrap correctly.

Also treat any non-finite stored bound coordinate as 'might match' in
bbox pruning, since NaN comparisons would otherwise silently prune to False.
@abnobdoss abnobdoss closed this Jun 25, 2026
@abnobdoss abnobdoss reopened this Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant