chore: Run Spark 4.0 SQL tests with native_datafusion scan#3728
Draft
andygrove wants to merge 4 commits intoapache:mainfrom
Draft
chore: Run Spark 4.0 SQL tests with native_datafusion scan#3728andygrove wants to merge 4 commits intoapache:mainfrom
andygrove wants to merge 4 commits intoapache:mainfrom
Conversation
Add native_datafusion scan-impl matrix entry for Spark 4.0 in spark_sql_test.yml and update 4.0.1.diff to ignore tests that fail with native_datafusion scan (same tests as Spark 3.4).
…1 diff Add missing CometNativeScanExec pattern matches to SchemaPruningSuite and FileBasedDataSourceSuite, fixing all 183 ParquetV1SchemaPruningSuite failures. Tag remaining incompatible tests with IgnoreCometNativeDataFusion.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Follow-up to #3694 which enabled this for Spark 3.5, and the companion PR for Spark 3.4.
Rationale for this change
We already run Spark SQL tests with
native_datafusionscan implementation for Spark 3.4 and 3.5 but not for Spark 4.0. This PR adds Spark 4.0 to the CI matrix fornative_datafusionscan testing.What changes are included in this PR?
native_datafusionscan-impl matrix entry for Spark 4.0 inspark_sql_test.ymlsql_hive-1exclusion for the newnative_datafusionSpark 4.0 config (same exclusion as theautoconfig)4.0.1.diffto add missingCometNativeScanExecpattern matches (ported from the 3.5.8 diff):SchemaPruningSuite.checkScanSchemata: addCometNativeScanExeccase to fix all 183ParquetV1SchemaPruningSuitefailures (the helper only matchedFileSourceScanExecandCometScanExec, missing the native scan node)FileBasedDataSourceSuite: addCometNativeScanExecto import anddataFilterspattern match4.0.1.diffto annotate tests withIgnoreCometNativeDataFusionthat are known to fail with native_datafusion scan:DynamicPartitionPruningSuite: "static scan metrics", "join key with multiple references on the filtering plan"FileBasedDataSourceSuite: "Spark native readers should respect spark.sql.caseSensitive", "Enabling/disabling ignoreMissingFiles using parquet", "SPARK-41017: filter pushdown with nondeterministic predicates"ParquetFilterSuite: "SPARK-25207: exception when duplicate fields in case-insensitive mode"ParquetIOSuite: "SPARK-35640: read binary as timestamp should throw schema incompatible error"ParquetQuerySuite: "SPARK-47447: read TimestampLTZ as TimestampNTZ", "SPARK-34212 Parquet should read decimals correctly", "row group skipping doesn't overflow when reading into larger type", "Enabling/disabling ignoreCorruptFiles"ParquetSchemaSuite: "schema mismatch failure error message for parquet vectorized reader", "SPARK-45604: schema mismatch failure error on timestamp_ntz to array<timestamp_ntz>"ParquetTypeWideningSuite: all "unsupported parquet conversion", "unsupported parquet timestamp conversion", "parquet decimal precision change", "parquet decimal precision and scale change", and "parquet widening conversion DateType -> TimestampNTZType" testsSubquerySuite: "Subquery reuse across the whole plan", "SPARK-43402: FileSourceScanExec supports push down data filter with scalar subquery"SQLViewSuite: "alter temporary view should follow current storeAnalyzedPlanForView config" (covers bothSimpleSQLViewSuiteandHiveSQLViewSuite)ExtractPythonUDFsSuite: "Python UDF should not break column pruning/filter pushdown -- Parquet V1"How are these changes tested?
By running the Spark SQL tests in CI with the new
native_datafusionconfiguration for Spark 4.0.