diff --git a/_questions/data-engineering-zoomcamp/module-6/062_2b59e8e6c1_spark-bigquery-read-write.md b/_questions/data-engineering-zoomcamp/module-6/062_2b59e8e6c1_spark-bigquery-read-write.md new file mode 100644 index 0000000..2884c2e --- /dev/null +++ b/_questions/data-engineering-zoomcamp/module-6/062_2b59e8e6c1_spark-bigquery-read-write.md @@ -0,0 +1,31 @@ +--- +id: 2b59e8e6c1 +question: How do I use Spark with BigQuery as a data source and sink? +sort_order: 62 +--- + +Add the connector package: `com.google.cloud.spark:spark-bigquery-with-dependencies_2.12` + +Read from BigQuery: + +```python +df = spark.read.format("bigquery") \ +.option("table", "project.dataset.table") \ +.load() +``` + +Write to BigQuery: + +```python +df.write.format("bigquery") \ +.option("table", "project.dataset.output_table") \ +.mode("overwrite") \ +.save() +``` + +Make sure: +- your GCP credentials are configured +- dataset location matches your query location +- the output dataset exists + +This enables distributed processing on top of warehouse data. \ No newline at end of file