Skip to content

Experimental: support parquet partition write#4670

Draft
kazantsev-maksim wants to merge 67 commits into
apache:mainfrom
kazantsev-maksim:support_parquet_partition_write
Draft

Experimental: support parquet partition write#4670
kazantsev-maksim wants to merge 67 commits into
apache:mainfrom
kazantsev-maksim:support_parquet_partition_write

Conversation

@kazantsev-maksim

@kazantsev-maksim kazantsev-maksim commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Part of: #1625, #4547

Rationale for this change

The native Parquet writer previously only supported local filesystem and HDFS output, and explicitly bailed out (Unsupported) on any query with partition columns. This meant that the very common INSERT ... PARTITIONED BY (...) pattern and any S3-backed table fell back to Spark's row-based writer.

What changes are included in this PR?

Native (Rust): New partition_writer.rs with PartitionedWriter, which:
splits each RecordBatch by partition key and routes rows to a per-partition writer keyed by sub-directory (e.g. a=1/b=2); Escapes partition values to exactly mirror Spark/Hive ExternalCatalogUtils.escapePathName (control chars + a fixed special-char set percent-encoded as upper-case %XX);
maps null/empty partition values to HIVE_DEFAULT_PARTITION (Spark's DEFAULT_PARTITION_NAME);
lazily creates writers on first use and closes all open writers concurrently, returning the real part-file paths.
parquet_writer.rs: refactored into a ParquetWriter trait with a StorageWriterFactory that selects the backend by URL scheme — local FS, HDFS (OpendalWriter), and now S3A (OpendalWriter behind s3-opendal). S3 credentials/endpoint/region are read from fs.s3a.* object-store options.

Proto: new repeated string partition_columns = 9 on the ParquetWriter message; planner wires partition_columns through to the native writer.

Spark (Scala): CometDataWritingCommand: supported output filesystems extended to file, hdfs, s3a; partitioned writes are no longer rejected (only static partitions remain unsupported); partition column names are serialized into the proto.

How are these changes tested?

Testing in progress

@kazantsev-maksim kazantsev-maksim marked this pull request as draft June 17, 2026 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant