[SPARK-55121][PYTHON][SS] Add DataStreamReader.name() to Classic PySpark #53898

ericm-db · 2026-01-21T21:54:53Z

What changes were proposed in this pull request?

This PR adds the name() method to Classic PySpark's DataStreamReader class. This method allows users to specify a name for streaming sources, which is used in checkpoint metadata and enables stable checkpoint locations for source evolution.

Changes include:

Add name() method to DataStreamReader in python/pyspark/sql/streaming/readwriter.py
Add comprehensive test suite in python/pyspark/sql/tests/streaming/test_streaming_reader_name.py
Update compatibility test to mark name as currently missing from Connect (until the Connect PR merges)

The method validates that the source_name contains only ASCII letters, digits, and underscores, raising PySparkTypeError or PySparkValueError for invalid inputs.

Why are the changes needed?

This brings Classic PySpark to feature parity with the Scala/Java API for streaming source naming. The name() method is essential for:

Identifying sources in checkpoint metadata
Enabling stable checkpoint locations during source evolution
Providing consistency across Classic and Connect implementations

Does this PR introduce any user-facing change?

Yes. Users can now call .name() on DataStreamReader in Classic PySpark:

spark.readStream.format("parquet").name("my_source").load("/path")

How was this patch tested?

Added comprehensive unit tests in test_streaming_reader_name.py covering:
- Valid name patterns (letters, digits, underscores)
- Invalid names (hyphens, spaces, dots, special characters, empty strings, None, wrong types)
- Method chaining
- Different data formats (parquet, json)
- Integration with streaming queries
Updated compatibility tests to account for the current state where Classic has name but Connect doesn't yet

Was this patch authored or co-authored using generative AI tooling?

No.

github-actions · 2026-01-21T21:55:02Z

JIRA Issue Information

=== Task SPARK-55121 ===
Summary: Add DataStreamReader.name() to Classic PySpark
Assignee: None
Status: Open
Affected: ["4.2.0"]

This comment was automatically generated by GitHub Actions

HyukjinKwon

I am fine with this but I think it would be better if @HeartSaVioR has some time to take a look.

anishshri-db

lgtm pending green CI

python/pyspark/sql/tests/streaming/test_streaming_reader_name.py

python/pyspark/sql/streaming/readwriter.py

python/pyspark/sql/tests/streaming/test_streaming_reader_name.py

gaogaotiantian

Some final minor suggestions :)

python/pyspark/sql/streaming/readwriter.py

python/pyspark/sql/tests/streaming/test_streaming_reader_name.py

### What changes were proposed in this pull request? This PR adds the `name()` method to Classic PySpark's `DataStreamReader` class. This method allows users to specify a name for streaming sources, which is used in checkpoint metadata and enables stable checkpoint locations for source evolution. Changes include: - Add `name()` method to `DataStreamReader` in `python/pyspark/sql/streaming/readwriter.py` - Add comprehensive test suite in `python/pyspark/sql/tests/streaming/test_streaming_reader_name.py` - Update compatibility test to mark `name` as currently missing from Connect (until the Connect PR merges) The method validates that the source_name contains only ASCII letters, digits, and underscores, raising `PySparkTypeError` or `PySparkValueError` for invalid inputs. ### Why are the changes needed? This brings Classic PySpark to feature parity with the Scala/Java API for streaming source naming. The `name()` method is essential for: 1. Identifying sources in checkpoint metadata 2. Enabling stable checkpoint locations during source evolution 3. Providing consistency across Classic and Connect implementations ### Does this PR introduce _any_ user-facing change? Yes. Users can now call `.name()` on DataStreamReader in Classic PySpark: ```python spark.readStream.format("parquet").name("my_source").load("/path") ``` ### How was this patch tested? - Added comprehensive unit tests in `test_streaming_reader_name.py` covering: - Valid name patterns (letters, digits, underscores) - Invalid names (hyphens, spaces, dots, special characters, empty strings, None, wrong types) - Method chaining - Different data formats (parquet, json) - Integration with streaming queries - Updated compatibility tests to account for the current state where Classic has `name` but Connect doesn't yet ### Was this patch authored or co-authored using generative AI tooling? Yes.

- Fix empty string validation to use VALUE_NOT_NON_EMPTY_STR error - Check type before checking emptiness to avoid confusing error messages - Combine all invalid name tests into single test with subTests - Use PySparkValueError instead of generic Exception for invalid names - Add more invalid name test cases (dollar, hash, exclamation)

- Remove redundant empty string validation check - empty and whitespace-only strings are now caught by the regex pattern validation - Remove unnecessary str() wrapper since type is already validated - Consolidate empty string test into invalid names test - Merge None and wrong type tests into single test method with multiple cases

ericm-db changed the title ~~[SPARK-XXXXX][PYTHON] Add DataStreamReader.name() to Classic PySpark~~ [SPARK-XXXXX][PYTHON][SS] Add DataStreamReader.name() to Classic PySpark Jan 21, 2026

github-actions bot added SQL STRUCTURED STREAMING PYTHON labels Jan 21, 2026

ericm-db changed the title ~~[SPARK-XXXXX][PYTHON][SS] Add DataStreamReader.name() to Classic PySpark~~ [SPARK-55121][PYTHON][SS] Add DataStreamReader.name() to Classic PySpark Jan 21, 2026

ericm-db force-pushed the classic-datastream-reader-name branch from 02fa47f to df74e5f Compare January 21, 2026 21:57

HyukjinKwon approved these changes Jan 21, 2026

View reviewed changes

anishshri-db approved these changes Jan 21, 2026

View reviewed changes

ueshin reviewed Jan 21, 2026

View reviewed changes

python/pyspark/sql/tests/streaming/test_streaming_reader_name.py Show resolved Hide resolved

python/pyspark/sql/streaming/readwriter.py Show resolved Hide resolved

ericm-db requested a review from ueshin January 21, 2026 22:26

ueshin approved these changes Jan 21, 2026

View reviewed changes

gaogaotiantian reviewed Jan 21, 2026

View reviewed changes

python/pyspark/sql/streaming/readwriter.py Outdated Show resolved Hide resolved

gaogaotiantian reviewed Jan 21, 2026

View reviewed changes

python/pyspark/sql/tests/streaming/test_streaming_reader_name.py Outdated Show resolved Hide resolved

gaogaotiantian reviewed Jan 21, 2026

View reviewed changes

python/pyspark/sql/tests/streaming/test_streaming_reader_name.py Outdated Show resolved Hide resolved

ericm-db requested a review from gaogaotiantian January 22, 2026 00:14

gaogaotiantian reviewed Jan 22, 2026

View reviewed changes

python/pyspark/sql/streaming/readwriter.py Outdated Show resolved Hide resolved

python/pyspark/sql/tests/streaming/test_streaming_reader_name.py Outdated Show resolved Hide resolved

python/pyspark/sql/tests/streaming/test_streaming_reader_name.py Outdated Show resolved Hide resolved

ericm-db requested a review from gaogaotiantian January 22, 2026 00:44

ericm-db added 8 commits January 22, 2026 21:06

Trigger CI

f3e6567

Add test_streaming_reader_name to test module registry

a7979a7

Trigger CI after module registry update

bb911b5

triggering CI

64eaa73

python reformat

f9df920

ericm-db force-pushed the classic-datastream-reader-name branch from 787bc26 to f9df920 Compare January 23, 2026 05:06

anishshri-db closed this in 8e1e126 Jan 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55121][PYTHON][SS] Add DataStreamReader.name() to Classic PySpark #53898

[SPARK-55121][PYTHON][SS] Add DataStreamReader.name() to Classic PySpark #53898

Uh oh!

ericm-db commented Jan 21, 2026

Uh oh!

github-actions bot commented Jan 21, 2026 •

edited

Loading

Uh oh!

HyukjinKwon left a comment

Uh oh!

anishshri-db left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaogaotiantian left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-55121][PYTHON][SS] Add DataStreamReader.name() to Classic PySpark #53898

[SPARK-55121][PYTHON][SS] Add DataStreamReader.name() to Classic PySpark #53898

Uh oh!

Conversation

ericm-db commented Jan 21, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JIRA Issue Information

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

anishshri-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaogaotiantian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions bot commented Jan 21, 2026 •

edited

Loading