diff --git a/_questions/data-engineering-zoomcamp/module-6/060_a71d2105aa_spark-write-multiple-parquet-per-partition.md b/_questions/data-engineering-zoomcamp/module-6/060_a71d2105aa_spark-write-multiple-parquet-per-partition.md new file mode 100644 index 0000000..3c57014 --- /dev/null +++ b/_questions/data-engineering-zoomcamp/module-6/060_a71d2105aa_spark-write-multiple-parquet-per-partition.md @@ -0,0 +1,13 @@ +--- +id: a71d2105aa +question: Why does Spark write multiple parquet files after repartitioning a DataFrame? +sort_order: 60 +--- + +Spark processes data in partitions. When you write a DataFrame to disk, Spark writes each partition as a separate output file. For example: + +```python +trips.repartition(4).write.parquet("output/") +``` + +This creates four parquet files because the DataFrame now has four partitions. This behavior enables Spark to write data in parallel and can improve performance on large datasets. \ No newline at end of file