[SPARK-55729][SS] Support state data source reader for new state format v4 on stream-stream join by HeartSaVioR · Pull Request #54845 · apache/spark

HeartSaVioR · 2026-03-17T02:12:09Z

What changes were proposed in this pull request?

This PR proposes to support reading state via state data source reader, for new state format v4 on stream-stream join.

state data source reader supports both options (joinSide, storeName) for reading the state in stream-stream join operator. This PR enables both options for the format v4. The only difference is that the storeNames are different in v4, although we expect most users would only deal with joinSide option. Reading state rows with storeName would only be needed when debugging.

Why are the changes needed?

State data source reader didn't support reading the state with the new state format v4 in stream-stream join. This PR will enable the support.

Does this PR introduce any user-facing change?

Yes, state data source reader will be able to read the state with the new state format v4 in stream-stream join. In terms of UX there will be no difference with older state format versions, though.

How was this patch tested?

Existing UTs are expanded to test for state format version 4.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude 4.6 opus

reflect review comments (by claude, need another confirm)

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala

anishshri-db · 2026-03-21T00:13:16Z

.../scala/org/apache/spark/sql/execution/datasources/v2/state/StreamStreamJoinStateHelper.scala

-        val vSchema = manager.readSchemaFile().find { schema =>
-          schema.colFamilyName == storeNames(1)
-        }.map(_.valueSchema).get
+        // Try v3 CF names first; if not found, use v4 CF names


Why so ? we have the format version in the offset log right ?

anishshri-db · 2026-03-21T00:14:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/SchemaHelper.scala

        case 1 if Utils.isTesting => new SchemaV1Writer
        case 2 => new SchemaV2Writer
-        case 3 => new SchemaV3Writer
+        case v if v >= 3 => new SchemaV3Writer


nit: maybe just be explicit for supported versions and throw an error for anything > 4 ?

anishshri-db · 2026-03-21T00:15:31Z

...est/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceReadSuite.scala

    testStreamStreamJoin(3)
  }

+  test("stream-stream join, state ver 4") {


Do we have a test for the change tracking option also ?

HeartSaVioR · 2026-03-23T21:43:52Z

CI failure seems unrelated
https://github.com/HeartSaVioR/spark/actions/runs/23436596292/job/68176309806

...org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceChangeDataReadSuite.scala

dylanwong250 · 2026-03-23T21:51:28Z

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala

-    val useMultipleValuesPerKey = SchemaUtil.checkVariableType(stateVariableInfoOpt,
-      StateVariableType.ListState)
+    val useMultipleValuesPerKey = StatePartitionReaderUtils.isMultiValuedCF(
+      joinColFamilyOpt.getOrElse(""), stateVariableInfoOpt)


I find the .getOrElse("") a bit confusing. Maybe we can do something like joinColFamilyOpt.exists(StatePartitionReaderUtils.v4JoinCFNames.contains) or have isMultiValuedCF take a Option[String].

Done - 1f1fd48

dylanwong250 · 2026-03-23T21:59:03Z

.../scala/org/apache/spark/sql/execution/datasources/v2/state/StreamStreamJoinStateHelper.scala

+        val schemas = manager.readSchemaFile()
+
+        val kSchema = schemas.find(_.colFamilyName == v4Names(0)).map(_.keySchema).get
+        val vSchema = schemas.find(_.colFamilyName == v4Names(0)).map(_.valueSchema).get


I see v3 uses storeNames(1) for vSchema. This being v4Names(0) is a bit confusing. Maybe extract into a val and add a comment so it is easier to understand.

Done - 1f1fd48

...est/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceReadSuite.scala

...g/apache/spark/sql/execution/datasources/v2/state/StreamStreamJoinStatePartitionReader.scala

HeartSaVioR · 2026-03-24T13:12:45Z

https://github.com/HeartSaVioR/spark/runs/68291049885
Again, protobuf breaking change CI failure again.

dylanwong250

Thanks !

HeartSaVioR added 5 commits March 18, 2026 11:12

WIP support state data source reader for stream-stream join v4

75ab3d1

reflect review comments (by claude, need another confirm)

address review comments (from Claude)

7a7c661

update todo message

1bf8604

remove TODO comment - we already have e2e test in other place

b2b29d3

fix tests

3902f7b

HeartSaVioR force-pushed the SPARK-55729 branch from 19a0871 to 3902f7b Compare March 18, 2026 02:12

anishshri-db reviewed Mar 21, 2026

View reviewed changes

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala Show resolved Hide resolved

anishshri-db reviewed Mar 21, 2026

View reviewed changes

HeartSaVioR added 2 commits March 21, 2026 22:43

reflect review comments

7642952

fix test

678582c

anishshri-db reviewed Mar 23, 2026

View reviewed changes

...org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceChangeDataReadSuite.scala Show resolved Hide resolved

dylanwong250 reviewed Mar 23, 2026

View reviewed changes

HeartSaVioR added 2 commits March 24, 2026 11:17

Address review comment from Anish

e4f13b0

Address Dylan's review comments

1f1fd48

HeartSaVioR requested review from anishshri-db and dylanwong250 March 24, 2026 13:11

anishshri-db approved these changes Mar 24, 2026

View reviewed changes

dylanwong250 approved these changes Mar 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55729][SS] Support state data source reader for new state format v4 on stream-stream join#54845

[SPARK-55729][SS] Support state data source reader for new state format v4 on stream-stream join#54845
HeartSaVioR wants to merge 9 commits intoapache:masterfrom
HeartSaVioR:SPARK-55729

HeartSaVioR commented Mar 17, 2026

Uh oh!

Uh oh!

anishshri-db Mar 21, 2026

Uh oh!

anishshri-db Mar 21, 2026

Uh oh!

anishshri-db Mar 21, 2026

Uh oh!

HeartSaVioR commented Mar 23, 2026

Uh oh!

Uh oh!

dylanwong250 Mar 23, 2026

Uh oh!

HeartSaVioR Mar 24, 2026

Uh oh!

dylanwong250 Mar 23, 2026

Uh oh!

HeartSaVioR Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

HeartSaVioR commented Mar 24, 2026

Uh oh!

dylanwong250 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

HeartSaVioR commented Mar 17, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

anishshri-db Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

anishshri-db Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

anishshri-db Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Mar 23, 2026

Uh oh!

Uh oh!

dylanwong250 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

dylanwong250 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HeartSaVioR commented Mar 24, 2026

Uh oh!

dylanwong250 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants