[VL][Delta] Add JVM Delta DV scan handoff by malinjawi · Pull Request #12131 · apache/gluten

malinjawi · 2026-05-24T06:50:18Z

What changes are proposed in this pull request?

This PR is the next step in the split Delta deletion-vector (DV) stack, following #12040.

It adds the JVM-side Delta DV scan metadata handoff that consumes the native Velox Delta DV reader layer introduced by #12040, without adding DELETE/UPDATE/MERGE DV DML support yet.

Main changes:

derive Delta DV payloads from per-file Delta scan metadata while building Velox split info
materialize Delta DV descriptors on the JVM side through Delta's StoredBitmap / RoaringBitmapArrayFormat.Portable path
pass materialized DV payloads from JVM split planning into native Velox scan execution through external split payload buffers
add Delta 3.3 and Delta 4.0 metadata shims for normalizing DV split metadata
add split payload plumbing through the Velox iterator, JNI, plan evaluator, runtime, and Substrait-to-Velox conversion path
strip Spark's injected DV predicate and internal DV columns only after native DV scan handoff is available
preserve Delta row-index metadata when the plan still needs it for DML-like metrics/row-index consumers
preserve Delta column-mapping behavior for native Delta scans
add fallback/reporting visibility for unsupported Delta DV scan paths
keep unsafe or unsupported paths on Spark rather than silently offloading them

This PR is intentionally JVM scan handoff only:

no DELETE/UPDATE/MERGE DV DML command implementation yet
no native DV bitmap construction for DML yet
no Delta 2.4 / Spark 3.4 DV scan handoff
no native handoff for spark.databricks.delta.deletionVectors.useMetadataRowIndex=false yet

The no-metadata-row-index path currently falls back because it relies on Spark's row-index filtering contract, especially for DML-created DVs. Native support for that path should be added only after it can prove the same correctness contract.

Those pieces will be added in follow-up split PRs.

issue #11901.

How was this patch tested?

Added focused JVM/Velox Delta scan coverage in:

backends-velox/src-delta33/test/scala/org/apache/spark/sql/delta/DeltaDeletionVectorHandoffSuite.scala
backends-velox/src-delta33/test/scala/org/apache/spark/sql/delta/DeltaSuite.scala
backends-velox/src-delta40/test/scala/org/apache/spark/sql/delta/DeltaDeletionVectorHandoffSuite.scala
gluten-delta/src/test/scala/org/apache/gluten/execution/DeltaSuite.scala

Covered cases:

native scan handoff for Delta tables with deletion vectors
materialized DV payload extraction from Delta scan metadata
safe stripping of Spark DV predicates after native payload handoff
partitioned Delta tables with DVs
multiple DV-bearing files
prepared Delta scans / stats-skipping paths
Delta column mapping with DVs
Spark 4 Delta DV scan behavior
fallback correctness when metadata row index is disabled
fallback/reporting visibility for unsupported DV scan paths

Validation run:

local ./dev/format-scala-code.sh check
local ./dev/run-scala-test.sh --force -Pjava-17,spark-3.5,backends-velox,hadoop-3.3,spark-ut,delta -pl backends-velox -s org.apache.gluten.execution.VeloxDeltaSuite (19/19 passed)
local ./dev/run-scala-test.sh --force -Pjava-17,spark-4.0,scala-2.13,backends-velox,hadoop-3.3,spark-ut,delta -pl backends-velox -s org.apache.gluten.execution.VeloxDeltaSuite (19/19 passed)
local ./build/mvn -pl gluten-delta -am -Pbackends-velox -Pspark-3.4 -Pdelta -DskipTests compile
local ./build/mvn -pl gluten-delta,backends-velox -am -Pbackends-velox -Pspark-3.5 -Pdelta -DskipTests test-compile
local ./build/mvn -pl gluten-delta,backends-velox -am -Pbackends-velox -Pspark-4.0 -Pscala-2.13 -Pdelta -DskipTests test-compile
local git diff --check upstream/main..HEAD

After addressing review feedback to follow the #10740 split-time handoff shape more closely:

local ./dev/format-scala-code.sh check
local ./build/mvn -pl gluten-delta,backends-velox -am -Pbackends-velox -Pspark-3.5 -Pdelta -DskipTests test-compile
local ./build/mvn -pl gluten-delta,backends-velox -am -Pbackends-velox -Pspark-4.0 -Pscala-2.13 -Pdelta -DskipTests test-compile
local git diff --check

A focused local handoff suite run was attempted, but this machine does not have darwin/aarch64/libgluten.dylib, so the suite aborted during SparkContext startup before executing tests. CI should cover the runtime path.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: IBM BOB

github-actions · 2026-05-24T06:50:46Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-24T07:14:06Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-24T07:15:48Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-24T15:10:08Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-24T15:23:18Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-24T20:53:42Z

Run Gluten Clickhouse CI on x86

zhztheplayer

Thanks @malinjawi.

zhztheplayer · 2026-05-26T13:28:14Z

+  private lazy val deltaDeletionVectorRegistration
+      : DeltaScanTransformer.DeletionVectorRegistration =
+    DeltaScanTransformer.registerDeletionVectorsFromFileFormat(relation)


Can we follow the approach in the previous attempt #10740 to pass DVs to native? Would code be simpler that way?

Thanks for pointing that out @zhztheplayer

I updated this in d06ec27 to follow the same split-time handoff shape from #10740 more closely.

The PR now removes the driver-side DeltaDeletionVectorRegistry and the relation-wide lazy DV registration from DeltaScanTransformer. DV payloads are materialized from each PartitionedFile's Delta metadata when building split info, then passed to native through the external split payload buffers added in this PR.

I kept the #12040-style external payload channel instead of embedding the serialized bitmap directly into Substrait like #10740 did, because #12040 introduced the native Delta split descriptor path that consumes payload buffers separately. This keeps the code simpler while avoiding large binary DV payloads in the Substrait plan.

github-actions · 2026-05-26T15:23:38Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-26T18:02:03Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-26T20:13:06Z

Run Gluten Clickhouse CI on x86

zhztheplayer · 2026-05-27T10:37:40Z

+import org.apache.spark.sql.delta.sources.DeltaSQLConf
+
+/** Shadow Delta's PrepareDeltaScan to inject backend-specific DV preprocessing. */
+class PrepareDeltaScan(protected val spark: SparkSession)


Would you explain a bit what this rule is doing and why it's needed?

Thanks @zhztheplayer. This is a Delta 3.3 compatibility shim around Delta's PrepareDeltaScan.

It keeps Delta's normal scan preparation first, including stats skipping, metadata-query optimization, and transaction read tracking. After that, Gluten runs DV preprocessing so the scan exposes Delta's internal DV metadata/row-deleted column.

Gluten needs that metadata to materialize the per-file DV payload for the native split. Later, once the native split has the payload, Gluten strips Spark's synthetic DV predicate/internal columns so Velox applies the DV filter natively and we do not filter twice.

I can expand the comment to make this flow clear.

zhztheplayer · 2026-05-27T10:40:15Z

+    this(runtime, iterHandle, null);
+  }
+
+  public ColumnarBatchOutIterator(Runtime runtime, long iterHandle, Object retainedReference) {


What is retainedReference for?

Yea that was unclear name so I renamed this to retainedSplitPayloadBuffers and typed it as ByteBuffer[][].

It keeps the Java-owned direct buffers reachable while Velox holds native views into them, so the payload memory cannot be GC’d during scan execution.

zhztheplayer · 2026-05-27T10:43:20Z

+  /// Optional externally provided deletion vector payloads aligned with metadataColumns.
+  std::vector<std::optional<SplitPayloadBufferView>> deletionVectorPayloads;
+


Can we add DeltaSplitInfo : SplitInfo?

Addressed. I added DeltaSplitInfo on the Velox side and moved the Delta DV split state there instead of keeping it on the generic split metadata path.

zhztheplayer · 2026-05-27T10:44:18Z

      metadataColumnMap[metadataColumn.key()] = metadataColumn.value();
    }
+    for (const auto& otherMetadataColumn : file.other_const_metadata_columns()) {
+      if (auto unpackedValue = unpackMetadataValue(otherMetadataColumn.value())) {


Can we extend the substrait proto, to avoid placing the essential info in otherMetadataColumn?

Got it. I extended the Substrait read options with typed Delta DV fields and stopped carrying this essential Delta metadata through other_const_metadata_columns.

github-actions · 2026-05-27T12:40:46Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-27T12:59:44Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-27T14:16:30Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-27T17:45:59Z

Run Gluten Clickhouse CI on x86

github-actions Bot added CORE works for Gluten Core VELOX DATA_LAKE labels May 24, 2026

malinjawi force-pushed the split/delta-dv-java-scan-pr branch from af4d0fb to b0da63a Compare May 24, 2026 07:13

github-actions Bot added the DOCS label May 24, 2026

[VL][Delta] Add JVM Delta DV scan handoff

b8f932c

malinjawi force-pushed the split/delta-dv-java-scan-pr branch from b0da63a to b8f932c Compare May 24, 2026 07:15

github-actions Bot removed the DOCS label May 24, 2026

[VL][Delta] Fix DV scan CI coverage

d8c3f61

[VL][Delta] Fix Scala format

f30d47a

[VL][Delta] Fix DV scan CI fallbacks

d142569

zhztheplayer reviewed May 26, 2026

View reviewed changes

[VL][Delta] Simplify DV scan payload handoff

c6f05f0

malinjawi force-pushed the split/delta-dv-java-scan-pr branch from d06ec27 to c6f05f0 Compare May 26, 2026 17:39

Merge branch 'main' into split/delta-dv-java-scan-pr

4f2147a

malinjawi requested a review from zhztheplayer May 26, 2026 22:47

zhztheplayer reviewed May 27, 2026

View reviewed changes

[VL][Delta] Type DV scan split metadata

fa2833b

Merge branch 'main' into split/delta-dv-java-scan-pr

5cf3b66

[VL][Delta] Fix DV split metadata serialization

32af86b

Merge branch 'main' into split/delta-dv-java-scan-pr

7afb74c

malinjawi requested a review from zhztheplayer May 27, 2026 18:09

		/// Optional externally provided deletion vector payloads aligned with metadataColumns.
		std::vector<std::optional<SplitPayloadBufferView>> deletionVectorPayloads;

Conversation

malinjawi commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 24, 2026

Uh oh!

github-actions Bot commented May 24, 2026

Uh oh!

github-actions Bot commented May 24, 2026

Uh oh!

github-actions Bot commented May 24, 2026

Uh oh!

github-actions Bot commented May 24, 2026

Uh oh!

github-actions Bot commented May 24, 2026

Uh oh!

zhztheplayer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

malinjawi May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

malinjawi commented May 24, 2026 •

edited

Loading

malinjawi May 26, 2026 •

edited

Loading