Skip to content

FredaXin/airflow_demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Demo - NYC Citi Bike Station Data Pipeline (Airflow, BigQuery, GCP)

This project deploys an ETL pipeline using Apache Airflow and Google BigQuery. It processes current metadata of NYC's Citi Bike stations. The pipeline transforms raw station data into a report on station status and capacity.

Stack

  • Orchestration: Apache Airflow via GCP Composer
  • Data Warehouse: Google BigQuery
  • Source Data: bigquery-public-data.new_york_citibike.citibike_stations
  • Language: Python (Airflow DAG) and SQL

Pipeline Overview

This pipeline can be ran on-demand (schedule=None) whenever an updated snapshot of the station metadata is needed. It performs a full overwrite of the destination report tables upon each execution.

The workflow consists of three main stages:

  • Stage (Task 1): Reads the raw citibike_stations data, filters out unusable records (e.g., capacity > 0), and selects necessary columns for processing.

  • Transform (Task 2): Provide derived metrics, such as the station_status (ACTIVE, LIMITED_SERVICE, DECOMMISSIONED) and the total_region_capacity using BigQuery window functions.

  • Load (Task 3): Saves to the final analysis table, which is clustered by region_id.

Snapshots of the Deployed Pipeline and Tables

  • DAG Run History Airflow DAG History

  • DAG Diagram Airflow DAG Diagram

  • Final Station Report Table Airflow DAG Diagram

About

Demo project showcasing ETL/ELT pipeline implementation using Apache Airflow (via GCP Composer) to orchestrate data movement and transformation in Google BigQuery.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages