This project builds an AWS data pipeline using S3 + AWS Glue Studio (Visual ETL) + Glue Data Catalog Crawler + Athena + QuickSight.
- Reads raw CSV datasets from S3 staging
- Performs joins:
- Artist ↔ Album (artist.id = album.artist_id)
- Join result ↔ Track (track.track_id = album.track_id)
- Drops unnecessary columns
- Writes curated output as Parquet (Snappy) to S3 datawarehouse
- Runs basic data quality rule (ColumnCount > 0)
- Makes the data queryable via Crawler + Athena, and visualizable in QuickSight
- AWS S3 (staging + data warehouse)
- AWS Glue Studio (Visual ETL) + PySpark
- AWS Glue Data Catalog + Crawler
- Amazon Athena
- Amazon QuickSight
- artists.csv
- albums.csv
- track.csv
- Parquet (snappy) curated dataset
- Upload input CSVs to your S3 staging bucket
- Create an AWS Glue Job (Glue Studio Visual)
- Attach an IAM role with permissions for S3, Glue, Logs, and Athena
- Run the job
- Run Glue Crawler on the datawarehouse S3 path
- Query in Athena
- Build dashboard in QuickSight
- PySpark Glue script:
src/glue_job.py - Visual job JSON:
src/glue_visual_job.json

