A production-grade, hybrid data engineering project that simulates Spotify's data infrastructure and processes real-time and batch listening data through a multi-layer Medallion Architecture (Bronze → Silver → Gold). This project demonstrates a full data lifecycle: from event-driven ingestion and warehouse modeling to real-time serving via a sub-second recommendation API.
📖 API Documentation: Detailed API Specs
- Architecture
- Tech Stack
- Data Flow
- Project Phases
- Prerequisites
- Setup & Installation
- Running the Pipeline
- Serving Layer & API
- Dashboard
- Project Structure
- Key Learnings
- Spotify Data Pipeline Status
- Author
The platform operates on two distinct data velocities, unified within Snowflake:
- Path A (Batch): Spotify API → Kafka → S3 → Snowflake → DBT Silver → DBT Gold → Streamlit Dashboard (Orchestrated by Airflow).
The batch pipeline follows a medallion architecture pattern with three distinct layers:
-
Bronze Layer: Raw data stored as Parquet files in S3, accessed via Snowflake external tables
-
Silver Layer: Cleaned, deduplicated data with quality tests
-
Gold Layer: Analytical models optimized for business intelligence and visualization
-
Path B (Near Real-Time): Spotify API → Python Poller (Systemd) → Snowflake → FastAPI (Redis Cache).
| Component | Technology | Version | Purpose |
|---|---|---|---|
| Streaming | Apache Kafka | 2.13 | High-throughput event buffering |
| Data Lake | AWS S3 | - | Partitioned Parquet storage (Bronze Layer) |
| Warehouse | Snowflake | 9.38.4 | Unified compute for Silver/Gold modeling |
| Modeling | dbt Core | 1.10.15 | SQL-based transformations & quality testing |
| Orchestration | Apache Airflow | 2.8.1 | Batch workflow scheduling & monitoring |
| Serving API | FastAPI | 0.115.0 | Sub-second recommendation serving |
| Cache | Redis | 7.0 | In-memory storage for API performance |
| Monitoring | Systemd | - | Linux daemon management for real-time services |
| Visualization | Streamlit | 1.40.1 | End-user analytical dashboard |
- Spotify API Integration: Replaced synthetic simulators with real-world data via OAuth 2.0.
- Batch Path: Historical listening events are produced to Kafka, batched into Parquet format, and landed in AWS S3 partitioned by year/month/day/hour.
- Near Real-Time Path: A custom Python Poller daemon checks the Spotify
/currently-playingendpoint every 5 seconds, logging state changes directly to Snowflake.
- Bronze Layer: Raw data accessed via Snowflake external tables.
- Silver Layer: Data is deduplicated, type-cast, and validated with strict quality tests.
- Gold Layer: Business-ready models calculate daily user metrics, artist affinity, and track co-occurrence patterns.
- Collaborative Filtering: Recommendations are generated using a weighted scoring model: (Co-occurrence * 0.7) + (Artist Affinity * 0.3).
- Performance: FastAPI serves these models with Redis caching, achieving <100ms response times for cache hits.
- Built the AWS/Snowflake stack and Kafka streaming layer.
- Implemented dbt Medallion modeling and Airflow orchestration for batch processing.
- Integrated real Spotify Web API for historical backfills and hourly ingestion.
- Pivoted from audio-feature modeling to Behavioral Collaborative Filtering using co-occurrence and artist preference.
- Deployed FastAPI service with a Redis caching layer for recommendation serving.
- Implemented background daemons via Systemd to manage the polling engine and API services.
- AWS Account (EC2, S3, IAM).
- Snowflake Account (Warehouse, Database, Schemas).
- Spotify Developer Account (API Client ID/Secret).
- GitHub Account for version control
- Gmail Account (For Airflow SMTP alerts).
- Python: 3.13.2 or higher
- Docker & Docker Compose: 2.24.0 or higher
- Redis Server: 7.0+ (For API caching).
- AWS CLI: 2.32.12 or higher
- Git: Latest version
- Instance Type: t3.small (2 vCPU, 2GB RAM)
- OS: Ubuntu 24.04 LTS
- Storage: 20GB SSD
- Region: us-east-2
- Additional: 2GB swap configured for stability
git clone https://github.com/sunilmakkar/spotify-data-pipeline.git
cd spotify-data-pipelineCreate a .env file in the project root:
# AWS Configuration
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
S3_BUCKET=spotify-data-lake-sunil-2025
AWS_REGION=us-east-2
# Kafka Configuration
KAFKA_BOOTSTRAP_SERVERS=localhost:9092
KAFKA_API_KEY=your_kafka_key
KAFKA_API_SECRET=your_kafka_secret
# Snowflake Configuration
SNOWFLAKE_ACCOUNT=your_account_id
SNOWFLAKE_USER=your_username
SNOWFLAKE_PASSWORD=your_password
SNOWFLAKE_WAREHOUSE=SPOTIFY_WH
SNOWFLAKE_DATABASE=SPOTIFY_DATA
SNOWFLAKE_SCHEMA=RAW
# Spotify Configuration (for Phase 5)
SPOTIFY_CLIENT_ID=your_client_id
SPOTIFY_CLIENT_SECRET=your_client_secret
SPOTIFY_REDIRECT_URI=http://localhost:8888/callback
# Gmail SMTP (for Airflow alerts)
GMAIL_USER=your_email@gmail.com
GMAIL_APP_PASSWORD=your_app_password
# Airflow
AIRFLOW_UID=50000# Configure AWS CLI
aws configure
# Create S3 bucket
aws s3 mb s3://spotify-data-lake-sunil-2025 --region us-east-2
# Verify bucket creation
aws s3 lsRun the following SQL in Snowflake:
-- Create warehouse
CREATE WAREHOUSE IF NOT EXISTS SPOTIFY_WH
WAREHOUSE_SIZE = 'X-SMALL'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE;
-- Create database and schemas
CREATE DATABASE IF NOT EXISTS SPOTIFY_DATA;
CREATE SCHEMA IF NOT EXISTS SPOTIFY_DATA.BRONZE;
CREATE SCHEMA IF NOT EXISTS SPOTIFY_DATA.SILVER;
CREATE SCHEMA IF NOT EXISTS SPOTIFY_DATA.GOLD;
-- Create external table (Bronze layer)
CREATE OR REPLACE EXTERNAL TABLE SPOTIFY_DATA.BRONZE.raw_plays
(event_id VARCHAR AS (value:event_id::VARCHAR),
user_id VARCHAR AS (value:user_id::VARCHAR),
track_id VARCHAR AS (value:track_id::VARCHAR),
played_at TIMESTAMP AS (value:played_at::TIMESTAMP),
duration_ms INT AS (value:duration_ms::INT),
device_type VARCHAR AS (value:device_type::VARCHAR))
LOCATION = @SPOTIFY_DATA.BRONZE.spotify_stage
FILE_FORMAT = (TYPE = PARQUET)
AUTO_REFRESH = FALSE;# SSH into EC2 instance
ssh -i your-key.pem ubuntu@your-ec2-ip
# Install Kafka
wget https://downloads.apache.org/kafka/3.5.1/kafka_2.13-3.5.1.tgz
tar -xzf kafka_2.13-3.5.1.tgz
cd kafka_2.13-3.5.1
# Start Zookeeper
bin/zookeeper-server-start.sh config/zookeeper.properties &
# Start Kafka
bin/kafka-server-start.sh config/server.properties &
# Create topics
bin/kafka-topics.sh --create --topic spotify-plays --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1
bin/kafka-topics.sh --create --topic spotify-events --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1
bin/kafka-topics.sh --create --topic spotify-likes --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1
bin/kafka-topics.sh --create --topic spotify-skips --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1# Clone repo on EC2
git clone https://github.com/sunilmakkar/spotify-data-pipeline.git
cd spotify-data-pipeline
# Create swap space (for t3.small memory optimization)
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Make swap permanent
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
# Start Docker containers
docker-compose up -d
# Verify containers are running
docker-compose ps# Install DBT (if running locally)
pip install dbt-core==1.10.15 dbt-snowflake==1.8.3
# Configure DBT profile (already mounted in Docker)
# Profile location: dbt/profiles.yml
# Test DBT connection
docker-compose exec airflow-scheduler bash -c "cd /opt/airflow/dbt && dbt debug"
# Run DBT models
docker-compose exec airflow-scheduler bash -c "cd /opt/airflow/dbt && dbt run"To enable near real-time playback detection, the platform uses a background daemon managed by Systemd. This service continuously polls the Spotify API and logs state changes to Snowflake.
- Configure the Service File: The service configuration is located in
systemd/spotify-poller.service. You must update the file paths within this file to match your EC2 environment:
[Unit]
Description=Spotify Currently Playing Poller
After=network.target
[Service]
# Update these paths to match your project location
User=ubuntu
WorkingDirectory=/home/ubuntu/spotify-data-pipeline
ExecStart=/home/ubuntu/spotify-data-pipeline/venv/bin/python src/spotify_poller.py
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
- Deploy and Enable the Background Service
- Copy the service file to the system directory, then enable it to ensure it starts automatically on boot:
# Copy the service file to systemd
sudo cp systemd/spotify-poller.service /etc/systemd/system/
# Reload systemd to recognize the new service
sudo systemctl daemon-reload
# Enable the service for auto-start on boot
sudo systemctl enable spotify-poller
# Start the service
sudo systemctl start spotify-poller
- Monitor the Service Logs
- Since the poller runs as a background process, you can verify its activity by following the system logs:
# View real-time logs for the poller
sudo journalctl -u spotify-poller -f
Similar to the poller, the FastAPI recommendation engine is managed as a background service to ensure sub-second response times and high availability.
- Deploy the API Service
- Follow the same process in step 8 to enable the FastAPI serving layer:
# Copy the service file
sudo cp systemd/recommendation-api.service /etc/systemd/system/
# Reload and enable
sudo systemctl daemon-reload
sudo systemctl enable recommendation-api
# Start the service
sudo systemctl start recommendation-api
- Verify Service Status
sudo systemctl status recommendation-api
- Access Airflow UI at
http://your-ec2-ip:8082. - Trigger the
spotify_pipeline_basicDAG to run the ingestion and dbt models.
- In Airflow UI, find DAG:
spotify_data_basic - Click the play button (▶) to trigger
- Monitor progress in Graph view (~6 minutes runtime)
- start - Dummy task marking pipeline start
- generate_events - EventSimulator creates 200 listening events
⚠️ PHASE 5: Spotify API call - start_consumer - Launches Kafka consumer process
- wait_for_s3_files - Polls S3 until Parquet files appear
- stop_consumer - Terminates consumer process
- refresh_snowflake_table - Refreshes external table to detect new S3 files
- dbt_compile - Compiles DBT project
- dbt_run_silver - Runs Silver layer transformations
- dbt_run_gold - Runs Gold layer analytics models
- dbt_test - Executes data quality tests
- DAG ID:
data_quality_monitoring - Schedule: Every 30 minutes
- Checks: Row counts, null values, duplicate detection, value ranges, freshness
- Alerts: Email notifications on failures
- Use
journalctl -u spotify-poller -fto monitor real-time playback detection. - Use
journalctl -u recommendation-api -fto view API request logs.
The platform includes a production-ready API for consuming processed data. For detailed endpoint specifications, performance benchmarks, and service management commands, see the API Documentation.
Key API Features:
GET /recommendations: Serves personalized tracks based on current playback.- Swagger UI: Interactive testing available at
/docs.
Live URL: https://spotify-data-pipeline-2025.streamlit.app
-
Overview
- Total plays, unique tracks, unique artists
- Total listening time
- Data date range
-
Trends
- Daily plays over time (line chart)
- Daily listening hours over time (line chart)
-
Top Tracks
- Bar chart of top 5 most played tracks
- Hover details: artist name, play count
-
Top Artists
- Bar chart of top 5 most played artists
- Shows play count and unique track count
-
Device Usage
- Pie chart: listening distribution by device type
- Bar chart: play counts per device
cd streamlit_dashboard
# Install dependencies
pip install -r requirements.txt
# Create secrets file
mkdir .streamlit
cat > .streamlit/secrets.toml << EOF
[snowflake]
user = "your_username"
password = "your_password"
account = "your_account_id"
warehouse = "SPOTIFY_WH"
database = "SPOTIFY_DATA"
schema = "GOLD"
EOF
# Run dashboard
streamlit run app.pyDashboard will open at http://localhost:8501
SPOTIFY-DATA-PIPELINE/
├── api/ # FastAPI recommendation service source
├── dags/ # Airflow workflow definitions
├── dbt/ # dbt models (Bronze, Silver, Gold)
├── docs/ # Technical specs & service guides
│ └── api/README.md # Dedicated API documentation
├── src/ # Core logic (Poller, Clients, Kafka)
├── systemd/ # Linux service configuration files
└── streamlit_dashboard/ # Streamlit UI source code
- Medallion Architecture: Separating raw, cleaned, and analytical layers improves maintainability and query performance
- Memory Optimization: Configured Airflow for t3.small (2GB RAM) by limiting parallelism and adding swap space
- Error Handling: Implemented retries with exponential backoff to handle transient failures gracefully
- Data Quality: DBT tests catch issues early before bad data propagates to Gold layer
- Incremental Processing: Date tracker ensures sequential data generation without duplicates
- Monitoring: Separate monitoring DAG provides continuous data quality visibility
- Custom Polling vs. Webhooks: ince the Spotify Web API lacks webhook support for playback events, I designed a custom polling service that achieves near real-time detection by monitoring state changes every 5 seconds.
- Daemon Management: Implementing the poller and API as Systemd background services ensured high availability, allowing the real-time engine to persist across sessions and automatically recover from system restarts.
- Hybrid Data Velocity: Managing both hourly batch DAGs and a 5-second polling interval within the same Snowflake environment demonstrated how to unify different data velocities for a single analytical view.
- Sub-second Serving: Leveraging Redis caching for the recommendation API highlighted the importance of in-memory storage to reduce latency from ~700ms (database query) to <100ms (cache hit) for end-user applications.
Status: 🔴 Decommissioned (January 2026)
This project was successfully built and deployed from November 2025 to January 2026. Infrastructure has been shut down to avoid ongoing cloud costs. All documentation, code, and evidence of the working system is preserved in this repository.
This project requires:
- Confluent Cloud account or local Kafka cluster
- Snowflake account with appropriate credentials
- Docker environment
- [Setup instructions preserved for reference]
- Spotify Web API for future integration possibilities
- Snowflake for powerful data warehousing capabilities
- Apache Airflow community for excellent orchestration tools
- DBT for modern data transformation framework
- Streamlit for rapid dashboard development
Sunil Makkar
- GitHub: @sunilmakkar
- Project Link: https://github.com/sunilmakkar/spotify-data-pipeline
Built with ❤️ as a portfolio project to demonstrate modern hybrid data engineering skills.




