Skip to content

[GLUTEN-12013][VL] Fix bloom-filter bytes corruption on whole-stage AQE fallback#12151

Open
brijrajk wants to merge 1 commit into
apache:mainfrom
brijrajk:fix/12013-bloom-filter-stage-fallback
Open

[GLUTEN-12013][VL] Fix bloom-filter bytes corruption on whole-stage AQE fallback#12151
brijrajk wants to merge 1 commit into
apache:mainfrom
brijrajk:fix/12013-bloom-filter-stage-fallback

Conversation

@brijrajk
Copy link
Copy Markdown

@brijrajk brijrajk commented May 27, 2026

What changes are proposed in this pull request?

Fixes #12013

Root cause

When ExpandFallbackPolicy triggers a whole-stage AQE fallback it reverts to the plan captured before HeuristicTransform runs (i.e. before all pre-transform rewrites). This means the substitution performed by BloomFilterMightContainJointRewriteRule — replacing vanilla Spark's BloomFilterMightContain with VeloxBloomFilterMightContain — is silently undone in the fallback plan.

If Stage 0 (the bloom_filter_agg subquery) has already executed natively it has produced Velox-format bloom filter bytes. The vanilla BloomFilterMightContain in the fallen-back filter stage then calls BloomFilterImpl.readFrom() on those bytes, which throws:

java.io.IOException: Unexpected Bloom filter version number (16777217)

or causes a native assertion failure (kBloomFilterV1 == version) during the merge phase.

Fix

Register BloomFilterMightContainFallbackPatcher as a second fallback-policy pass (after ExpandFallbackPolicy) in VeloxRuleApi. The patcher walks the subtree of every FallbackNode and replaces any remaining BloomFilterMightContain inside FilterExec nodes with VeloxBloomFilterMightContain, so the JVM filter can continue to read Velox-format bytes via JNI even after the whole-stage fallback.

The patcher is guarded by requireBloomFilterAggMightContainJointFallback() so it is a no-op for backends that do not require joint fallback (e.g. ClickHouse).

Files changed

  • BloomFilterMightContainFallbackPatcher.scala — New Rule[SparkPlan] that patches fallback plans
  • VeloxRuleApi.scala — Registers the patcher as a second fallback-policy pass
  • GlutenBloomFilterAggregateQuerySuite.scala — Regression test for the exact failure scenario

How was this patch tested?

A regression test "Test bloom_filter_agg whole-stage fallback does not corrupt bloom filter bytes" was added to GlutenBloomFilterAggregateQuerySuite (tagged Issue12013).

The test reproduces the precise failure path:

  • COLUMNAR_FILTER_ENABLED = false — forces FilterExec to fall back (net transition cost = 2)
  • COLUMNAR_WHOLESTAGE_FALLBACK_THRESHOLD = 2 — only the filter stage triggers whole-stage fallback via ExpandFallbackPolicy; the bloom_filter_agg subquery stages (inherent cost = 1 < threshold) continue to run natively and emit Velox-format bytes
  • ANSI_ENABLED = false — Spark 4.0 enables ANSI by default, which causes ObjectHashAggregateExec to fail Gluten validation and raises the agg-stage transition cost above 1; disabling ANSI keeps the agg cost at 1 so only the filter falls back as intended

Without the fix the test fails with IOException: Unexpected Bloom filter version number (16777217). With the fix all 200,003 rows are returned correctly.

The test was run inside the gluten-dev Docker container against the gluten-ut/spark40 module:

Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-sonnet-4-6)

@github-actions github-actions Bot added CORE works for Gluten Core VELOX labels May 27, 2026
@brijrajk brijrajk force-pushed the fix/12013-bloom-filter-stage-fallback branch from 6cbe5c1 to 4a56662 Compare May 27, 2026 11:30
…QE fallback

When ExpandFallbackPolicy triggers a whole-stage AQE fallback it reinstates
the plan from before HeuristicTransform (i.e. before pre-transform rewrites),
so BloomFilterMightContainJointRewriteRule's substitution of
BloomFilterMightContain -> VeloxBloomFilterMightContain is lost. If Stage 0
(bloom_filter_agg subquery) already executed natively it produced Velox-format
bytes; BloomFilterMightContain then calls BloomFilterImpl.readFrom() on those
bytes and throws:
  java.io.IOException: Unexpected Bloom filter version number (16777217)
or the native assertion kBloomFilterV1 == version fires during merge.

Fix: register BloomFilterMightContainFallbackPatcher as a second fallback-policy
pass (after ExpandFallbackPolicy). It walks FallbackNode subtrees and replaces
any remaining BloomFilterMightContain inside FilterExec with
VeloxBloomFilterMightContain, so the JVM filter can read Velox-format bytes via
JNI even after falling back to the JVM execution path.

The patcher is guarded by requireBloomFilterAggMightContainJointFallback() so
it is a no-op for backends that do not require joint fallback (e.g. ClickHouse).

A regression test is added to GlutenBloomFilterAggregateQuerySuite that
reproduces the failure path:
  - COLUMNAR_FILTER_ENABLED=false  -> FilterExec falls back (net cost 2)
  - WHOLESTAGE_FALLBACK_THRESHOLD=2 -> only filter stage falls back; agg runs
                                       natively and emits Velox bytes
  - ANSI_ENABLED=false             -> prevents agg validation failure on Spark 4.0
                                       which would raise agg-stage cost above 1

Fixes apache#12013

Generated-by: Claude Code (claude-sonnet-4-6)
@brijrajk brijrajk force-pushed the fix/12013-bloom-filter-stage-fallback branch from 4a56662 to 9bf19dc Compare May 27, 2026 11:38
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@brijrajk
Copy link
Copy Markdown
Author

Could a maintainer please remove the CORE label? All three changed files are Velox-backend-specific (backends-velox/ and gluten-ut/spark40/) — no common core code is touched. VELOX label only is correct. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fail to read the native bloom_filter when the stage fallback to java

1 participant