This project implements a robust, cloud-native ETL pipeline that extracts data from three NASA APIs:
- 🪐 Astronomy Picture of the Day (APOD)
- ☄️ Near Earth Object Web Service (NeoWs)
- 🚜 Mars Rover Photos
The pipeline is orchestrated on AWS Glue and enriched through a microservice hosted on AWS Lightsail, which performs tasks like image metadata extraction, classification (PyTorch) and asteroid threat score. Infrastructure is provisioned using Terraform, and the final datasets are stored in an AWS RDS PostgreSQL instance.
It follows the Medallion Architecture:
- Bronze: Raw API data (Extract)
- Silver: Cleaned and normalized data (Transform)
- Gold: Enriched, query-ready data (Enrich)
ETL Architecture diagram.
The pipeline is modular, extensible, and designed with traceability, reliability, and automation in mind. It includes workflow orchestration with job dependencies and retry logic for failure resilience.
⚠️ Most services run on the AWS Free Tier, but a few (e.g., Lightsail) might incur small costs. Don’t forget to tear down the infra withterraform destroywhen done.
The final enriched datasets stored in PostgreSQL can be easily visualized using Amazon QuickSight. This dashboard showcases metrics like asteroid threat levels, Mars rover activity by date and camera, and APOD image trends — all presented in an interactive and visually appealing format.
Example of the QuickSight Dashboard built on top of the gold-layer data.
As a bonus I created a simple web application that allows users to generate mosaics from the space images.
Example of image generated with the Mosaic Generator app.
- Add your secrets to
infra/secrets.auto.tfvars. Seeinfra/secrets.auto.tfvars.examplefor guidance.
nasa_api_key = "YOUR-KEY-HERE"
enrichment_service_api_key = "YOUR-KEY-HERE"
aws_access_key_id = "YOUR-KEY-HERE"
aws_secret_access_key = "YOUR-KEY-HERE"
db_password = "YOUR-RDS-PASSWORD-HERE"
alert_email = "YOUR-ALERT-EMAIL-HERE"- Deploy the infrastructure:
cd infra
terraform init
terraform applyAfter provisioning, deploy the containerized microservice on AWS Lightsail:
- Create the environment file
.envand config filelc.jsoninenrichment-service/. Use the.exampletemplates provided.
.env
API_KEY=YOUR-KEY-HERE
DB_NAME=nasa
DB_USER=postgres
DB_PASSWORD=YOUR-RDS-DB-PASSWORD
DB_HOST=YOUR-RDS-DB-HOST
DB_PORT=5432- Build and deploy using:
enrichment-service/build-and-deploy.bat- Once the infrastructure and enrichment service are up and running, you’ll need to initialize the PostgreSQL database with the necessary schemas, views, and materialized views. To do that, simply run the setup script:
To do that, simply run the setup script:
python etl/db/create_tables.pyThis script will:
- Create structured tables for APOD, NeoWs, and Mars Rover datasets
- Define helper views for unified queries
- Register materialized views to optimize performance for dashboarding
Make sure your .env file in enrichment-service/ is correctly set up with your RDS credentials before running the script.
- Automated Extraction & Loading using AWS Glue workflows and triggers
- Data Transformation with PySpark (flattening, normalization, deduplication)
- Image Metadata Extraction and Classification via FastAPI microservice + PyTorch (Lightsail)
- PostgreSQL-Compatible Schemas for APOD, NEO, and Mars Rover datasets
- Job Monitoring via CloudWatch Logs and optional SNS notifications
- IaC using Terraform with modular structure for scalability
.
├── infra/ # Terraform code (S3, RDS, Glue, IAM, etc.)
├── enrichment-service/ # FastAPI microservice for image metadata enrichment and Mosaic Generator app
├── etl/
│ ├── extract/ # Raw data extraction logic
│ ├── transform/ # Data cleaning & normalization
│ ├── enrich/ # Data enrichment through lightsail service
│ ├── db/ # Database definition for PostgreSQL
│ └── load/ # Load to RDS via JDBC
├── docs/ # Diagrams and reports
└── README.md

