-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-55260][Geo][SQL] Implement Parquet write support for Geo types #54039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
JIRA Issue Information=== Sub-task SPARK-55260 === This comment was automatically generated by GitHub Actions |
uros-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some failures which don't look related to my changes (subquery/in-subquery/in-set-operations.sql and OracleJoinPushdownIntegrationSuite - Docker integration), so @cloud-fan please review.
|
There are some failures again, but I don't think any are related to these changes. @cloud-fan |
.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala
Outdated
Show resolved
Hide resolved
...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
Show resolved
Hide resolved
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few nit comments, but looks good
|
Linter is failing with |
|
|
||
| override def supportsDataType(dataType: DataType): Boolean = dataType match { | ||
| // GeoSpatial data types in Parquet are limited only to types with supported SRIDs. | ||
| case g: GeometryType => GeometryType.isSridSupported(g.srid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a dumb question: GeometryType.isSridSupported sounds like not for parquet, but for spark itself. So this is a safe guard that if the input data contains geo values that not supported by Spark (not sure how it can happen), we don't write them to parquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a safe guard for Spark, but the same code is used both for reads and writes, please see the comment in the base class (sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala):
/**
* Returns whether this format supports the given [[DataType]] in read/write path.
* By default all data types are supported.
*/
def supportsDataType(dataType: DataType): Boolean = true
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala
Line 120 in 3fa07bb
| * Returns whether this format supports the given [[DataType]] in read/write path. |
|
the linter failure is unrelated, thanks, merging to master! |
What changes were proposed in this pull request?
Enable writing Geometry and Geography data to Parquet files.
Why are the changes needed?
Allowing users to persist geospatial data in Parquet format.
Does this PR introduce any user-facing change?
Yes, geo data can now be written to Parquet.
How was this patch tested?
Added tests for writing GEOMETRY and GEOGRAPHY to Parquet.
Was this patch authored or co-authored using generative AI tooling?
No.