AWS End-to-End Data Lake for YouTube Analytics

Overview

This project implements an end-to-end AWS data engineering pipeline to ingest, transform, and publish analytics-ready YouTube trending datasets using a multi-layer data lake design (Raw → Cleansed → Conformed).

Architecture

Tech Stack

Amazon S3 (Raw/Cleansed layers)
AWS Lambda (S3-triggered JSON processing)
AWS Glue (PySpark ETL)
AWS Glue Data Catalog
Amazon Athena / Amazon Redshift (analytics)

Data Flow

Upload raw CSV files to S3 using Hive-style partitions: region=us, region=ca, etc.
Lambda triggers on JSON uploads, normalizes nested JSON, converts to Parquet, and writes to cleansed S3 + updates Glue Catalog.
AWS Glue PySpark job reads raw table from Glue Catalog, applies mapping/cleaning, partitions output by region, and writes Parquet to cleansed S3.

Repository Structure

src/lambda/ → Lambda function (awswrangler + pandas)
src/glue/ → AWS Glue PySpark ETL script
src/cli/ → AWS CLI upload scripts
architecture/ → architecture diagram

How to Run (High Level)

Upload raw data to S3 (src/cli/s3_upload_commands.sh)
Configure Lambda environment variables (see sample-config/lambda_env_example.json)
Create Glue Crawlers / Catalog tables
Run Glue ETL job and query results using Athena/Redshift

Key Highlights

Event-driven ingestion (S3 → Lambda)
JSON normalization + Parquet conversion
Glue PySpark transformations + schema enforcement
Partitioning + predicate pushdown optimization

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
output_results		output_results
CLI_upolad_code.txt		CLI_upolad_code.txt
README.md		README.md
architect.png		architect.png
architecture.png		architecture.png
lambda_function.py		lambda_function.py
raw_data.zip		raw_data.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS End-to-End Data Lake for YouTube Analytics

Overview

Architecture

Tech Stack

Data Flow

Repository Structure

How to Run (High Level)

Key Highlights

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AWS End-to-End Data Lake for YouTube Analytics

Overview

Architecture

Tech Stack

Data Flow

Repository Structure

How to Run (High Level)

Key Highlights

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages