This project deploys an ETL pipeline using Apache Airflow and Google BigQuery. It processes current metadata of NYC's Citi Bike stations. The pipeline transforms raw station data into a report on station status and capacity.
- Orchestration: Apache Airflow via GCP Composer
- Data Warehouse: Google BigQuery
- Source Data:
bigquery-public-data.new_york_citibike.citibike_stations - Language: Python (Airflow DAG) and SQL
This pipeline can be ran on-demand (schedule=None) whenever an updated snapshot of the station metadata is needed. It performs a full overwrite of the destination report tables upon each execution.
The workflow consists of three main stages:
-
Stage (Task 1): Reads the raw citibike_stations data, filters out unusable records (e.g., capacity > 0), and selects necessary columns for processing.
-
Transform (Task 2): Provide derived metrics, such as the station_status (ACTIVE, LIMITED_SERVICE, DECOMMISSIONED) and the total_region_capacity using BigQuery window functions.
-
Load (Task 3): Saves to the final analysis table, which is clustered by region_id.


