Photo by Call Me Fred on Unsplash
This project is my final submission for the Data Engineering Zoomcamp (cohort 2025). I built a batch data pipeline that automatically updates daily, leveraging modern cloud and containerization technologies. The data used in this project is provided by the City of Zurich with an open source Creative Commons License (cc-zero). For a more detailed explanation of the whole project, check out my Medium Article.
Air pollution is a serious public health issue in many cities around the world. I discovered years ago that my city publishes all of its measurements of air pollutants, but I didn't know how to make sense of this data. I wanted to create a visualization that gives a clear indication of how serious a certain level of pollution is, for each independent pollutant, and which also explains what this pollutant does.
Therefore, I generated a dataset that contains the pollution measurements, but also emission limits and descriptions for each pollutant. For the visualization I then chose to calculate how many days in a year the pollution was higher than it should have been. This gives an intuitive understanding of how damaging the situation is.
The project utilizes the following technologies:
- Docker & Docker Compose: For containerization and orchestration (Docker, Docker Compose)
- Google Cloud Virtual Machine (VM) with Scheduling: To run automated data workflows (Google Compute Engine)
- Kestra: Orchestrating workflows in a scalable manner (Kestra)
- Google Cloud Storage: Storing raw data files (Google Cloud Storage)
- BigQuery: Cloud-based data warehouse for storing and querying data (BigQuery)
- Dataform: Automated data transformations (modeling) in BigQuery (Dataform)
- Looker Studio: Creating visualizations and reports (Looker Studio)
- GitHub: Version control and project management
Below is an illustration of the different steps in the data pipeline and the tools used to complete each one:
In the following sections, there's detailed information on how each of these steps was implemented.
The raw data files are provided by the City of Zurich in their Open Data Catalog. To extract these automatically, I set up a Docker container with a Kestra image using this docker-compose file and hosted it in a virtual machine in the Google Cloud. I put the VM on a schedule to optimize the cost and used the following start-up script to start up the container inside the VM:
#! /bin/bash
apt update
docker compose -f /home/angelakniederberger/docker-compose.yml startIn Kestra, I scheduled three flows: to set the key-value pairs for Google Cloud authentication and extract (1) the air pollution metrics and (2) the metadata on air pollutants, emission limits and data collection locations. After downloading the data from the Open Data Catalog, it gets saved to Google Cloud Storage and from there loaded into BigQuery tables.
For the air quality metrics, which are available in .csv format (sample data), the data then gets loaded into one BigQuery table incrementally (day by day). The metadata is provided in a nested .json format (sample data), for which I added a python transform into the Kestra flow to unnest and store it in separate .csv files. These are then also loaded into BigQuery tables as snapshot datasets.
Once the data is available in BigQuery, it gets transformed and modeled into tables specifically designed to fit the charts on the dashboard. I decided to use Dataform for this purpose, since it is native to the Google Cloud and integrates seamlessly with BigQuery. The data transformations are written into SQL files, which make up the data model. Below is a screenshot of the data lineage:
The very first nodes on the graph are views, which I then materialized into a joined staging table in the next step. Wherever possible, I partitioned the tables by date and clustered them by location, because these dimensions are frequently used in WHERE or GROUP BY clauses.
All the nodes on the graph above are available for querying as tables or views in BigQuery. This makes it easier to fulfill ad hoc reporting requirements. You can check all the datasets here:
- air_quality_data: tables as ingested by Kestra in production
- air_quality_data_dev: tables as ingested by Kestra in development
- dataform: views and tables generated by the dataform model in production
- dataform_transforms_dev: views and tables generated by the dataform model in development
Finally, I connected LookerStudio to BigQuery via Connected Sheets. I introduced this additional step to limit the query costs to a minimum. In the sheet, I scheduled a data extract from BigQuery once per day (which gets billed), but then the connection between the sheet and LookerStudio is free, so it doesn't matter how many people use the dashboard or how often the filters are adjusted. Below is a screenshot of the dashboard.
The dashboard can be accessed here.
Both Kestra and Dataform are connected to this git repository. This allows for a clear distinction between development and production code. In Kestra, I set up flows that let me sync the files from GitHub, while on Dataform, I scheduled a fresh compilation of the production code once each day.
Ensure you have the following:
- Docker & Docker Compose
- Google Cloud Platform account with billing enabled
- GitHub account
- Clone the repository:
- Download all the project files by running:
git clone https://github.com/Alessine/zurich_air_quality.git
- Download all the project files by running:
- Setup Google Cloud VM:
- Create a VM instance on Google Cloud.
- Configure a scheduling job with a startup script to run the pipelines periodically.
- Run Kestra for Workflow Management:
- Add your GitHub and GCP credentials and run the docker-compose.yml file to deploy Kestra using Docker Compose:
docker-compose up -d
- Configure workflows to automate data ingestion and transformations with the help of these flow files.
- Add your GitHub and GCP credentials and run the docker-compose.yml file to deploy Kestra using Docker Compose:
- Set up Dataform for Data Modeling:
- Create a repository and set up a development workspace.
- Use the workflow settings and definitions provided to set up the data model.
- Push the code to production, compile it and execute it to generate the BigQuery tables.
- Build Dashboards with Looker Studio:
- Connect the required BigQuery datasets with a connected sheet and set up an automatic refresh.
- Connect LookerStudio with the sheet to create an interactive dashboard.
├── data/ # Sample data files
├── definitions/ # Dataform SQL transformations
├── flows/ # Kestra workflows
├── img/ # Screenshots etc.
├── .gitignore # Files to be ignored by git
├── LICENSE # MIT Open Source License
├── README.md # Project readme file
├── docker-compose.yml # Docker Compose configuration for Kestra
└── workflow_settings.yaml # Dataform configurations
- Implement CI/CD pipelines for automated deployments.
- Enhance data quality checks and monitoring.
- Scale workflows for higher data volumes.
For any questions or contributions, feel free to connect via GitHub or LinkedIn.



