V2 rewrite (beta): Support Spark Connect#254
Open
chenliu0831 wants to merge 10 commits intomasterfrom
Open
Conversation
pydeequ/v2/verification.py
Outdated
| plan = _create_deequ_plan(extension) | ||
|
|
||
| # Use DataFrame.withPlan to properly create the DataFrame | ||
| return ConnectDataFrame.withPlan(plan, session=self._spark) |
There was a problem hiding this comment.
Feel free to ignore!
There is a breaking change between 3.5.x and 4.0.x
In GraphFrames we are using such a code:
def _dataframe_from_plan(plan: LogicalPlan, session: SparkSession) -> DataFrame:
if hasattr(DataFrame, "withPlan"):
# Spark 3
return DataFrame.withPlan(plan, session)
# Spark 4
return DataFrame(plan, session)I would recommend to switch to this approach to avoid the pain during Spark 4.x migration.
Contributor
Author
There was a problem hiding this comment.
Thanks for the callout - addressed in 69a5ed9
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue #, if available:
Description of changes:
This PR introduces PyDeequ 2.0 beta, a major release that replaces the Py4J-based architecture with Spark Connect for client-server communication.
The Deequ side change will be opened separately. the proto file here is copied for review purpose. For ease of testing, I created a pre-release https://github.com/awslabs/python-deequ/releases/tag/v2.0.0b1 to host the jars/wheels.
Motivation
The legacy PyDeequ relied on Py4J to bridge Python and the JVM, which had several limitations:
Spark Connect (introduced in Spark 3.4) provides a clean gRPC-based protocol that solves these issues.
Code Changes
New
pydeequ/v2/module with Spark Connect implementation:checks.py- Check and constraint buildersanalyzers.py- Analyzer classespredicates.py- Serializable predicates (eq,gte,between, etc.)verification.py- VerificationSuite and AnalysisRunnerproto/- Protobuf definitions and generated codeNew test suite in
tests/v2/:test_unit.py- Unit tests (no Spark required)test_analyzers.py- Analyzer integration teststest_checks.py- Check constraint teststest_e2e_spark_connect.py- End-to-end testsUpdated documentation:
API Changes
Testing
More details see https://github.com/awslabs/python-deequ/blob/v2_rewrite/README.md.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.