This project implements an end-to-end AWS data engineering pipeline to ingest, transform, and publish analytics-ready YouTube trending datasets using a multi-layer data lake design (Raw → Cleansed → Conformed).
- Amazon S3 (Raw/Cleansed layers)
- AWS Lambda (S3-triggered JSON processing)
- AWS Glue (PySpark ETL)
- AWS Glue Data Catalog
- Amazon Athena / Amazon Redshift (analytics)
- Upload raw CSV files to S3 using Hive-style partitions:
region=us,region=ca, etc. - Lambda triggers on JSON uploads, normalizes nested JSON, converts to Parquet, and writes to cleansed S3 + updates Glue Catalog.
- AWS Glue PySpark job reads raw table from Glue Catalog, applies mapping/cleaning, partitions output by region, and writes Parquet to cleansed S3.
src/lambda/→ Lambda function (awswrangler + pandas)src/glue/→ AWS Glue PySpark ETL scriptsrc/cli/→ AWS CLI upload scriptsarchitecture/→ architecture diagram
- Upload raw data to S3 (
src/cli/s3_upload_commands.sh) - Configure Lambda environment variables (see
sample-config/lambda_env_example.json) - Create Glue Crawlers / Catalog tables
- Run Glue ETL job and query results using Athena/Redshift
- Event-driven ingestion (S3 → Lambda)
- JSON normalization + Parquet conversion
- Glue PySpark transformations + schema enforcement
- Partitioning + predicate pushdown optimization
