From 9e01aff4e08f80d3ad99dc4f5ad41eac46c17c5b Mon Sep 17 00:00:00 2001 From: FAQ Bot Date: Sat, 7 Mar 2026 00:00:02 +0000 Subject: [PATCH] NEW: Why does Spark write multiple parquet files after repartitioning a DataF --- ...aa_spark-write-multiple-parquet-per-partition.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) create mode 100644 _questions/data-engineering-zoomcamp/module-6/060_a71d2105aa_spark-write-multiple-parquet-per-partition.md diff --git a/_questions/data-engineering-zoomcamp/module-6/060_a71d2105aa_spark-write-multiple-parquet-per-partition.md b/_questions/data-engineering-zoomcamp/module-6/060_a71d2105aa_spark-write-multiple-parquet-per-partition.md new file mode 100644 index 0000000..3c57014 --- /dev/null +++ b/_questions/data-engineering-zoomcamp/module-6/060_a71d2105aa_spark-write-multiple-parquet-per-partition.md @@ -0,0 +1,13 @@ +--- +id: a71d2105aa +question: Why does Spark write multiple parquet files after repartitioning a DataFrame? +sort_order: 60 +--- + +Spark processes data in partitions. When you write a DataFrame to disk, Spark writes each partition as a separate output file. For example: + +```python +trips.repartition(4).write.parquet("output/") +``` + +This creates four parquet files because the DataFrame now has four partitions. This behavior enables Spark to write data in parallel and can improve performance on large datasets. \ No newline at end of file