End-to-end data engineering & analytics project that extracts 7+ years of Formula 1 race data, models it in a dimensional data warehouse, and delivers interactive visualizations through a custom-built dashboard.
This project demonstrates a complete data engineering pipeline β from raw API extraction to a polished, interactive analytics dashboard. It processes historical Formula 1 data spanning the 2018β2025 seasons, covering race results, lap-by-lap telemetry, pit stop strategies, and qualifying sessions.
Key highlights:
- ETL Pipeline built in Python that extracts data from the FastF1 API, applies business transformations (team name normalization, time conversions, null handling), and loads it into a PostgreSQL star schema
- Dimensional Data Model with 4 dimension tables and 4 fact tables, optimized with indexes for analytical queries across 100k+ lap records
- Interactive Dashboard with 5 analytical views, dynamic team-based theming, and light/dark mode support
- Incremental Loading β the ETL detects already-loaded races and skips them, with graceful API rate limit handling
- 41 automated tests covering transformations, ORM models, and ETL logic
βββββββββββββββββββ
β FastF1 API β Historical F1 data source
ββββββββββ¬βββββββββ
β Extract (Python Β· requests Β· caching)
ββββββββββΌβββββββββ
β ETL Engine β Pandas Β· SQLAlchemy Β· Custom transforms
β (Transform) β Team normalization Β· Time conversion
ββββββββββ¬βββββββββ Null handling Β· Incremental logic
β Load (bulk insert Β· chunksize=100)
ββββββββββΌβββββββββ
β PostgreSQL β Docker container Β· Star Schema
β 15 (DW) β 4 dimensions Β· 4 fact tables
ββββββββββ¬βββββββββ Indexed FKs Β· Persistent volume
β Query (SQLAlchemy Β· cached)
ββββββββββΌβββββββββ
β Streamlit β 5 analytical pages
β Dashboard β Plotly charts Β· Dynamic theming
βββββββββββββββββββ Light/Dark mode Β· Team palettes
The data warehouse follows a Kimball-style star schema designed for analytical queries:
| Table | Description | Key Fields |
|---|---|---|
dim_drivers |
Driver master data | full_name, abbreviation, nationality, number |
dim_teams |
Team/constructor data with canonical names | team_name, team_color |
dim_circuits |
Circuit information | circuit_name, country, locality |
dim_calendar |
Race calendar linking year, round, circuit, date, and format (sprint/race) | year, round, circuit_id, event_date |
| Table | Grain | Description | Records |
|---|---|---|---|
fact_results |
1 row per driver/race | Final position, points, grid, status, fastest lap time, total race time | ~2,000+ |
fact_laps |
1 row per driver/lap | Sector times, compound, tyre life, position, pit in/out flags | ~100,000+ |
fact_pit_stops |
1 row per pit stop | Stop duration, compound before/after, lap number | ~5,000+ |
fact_qualifying |
1 row per driver/race | Q1/Q2/Q3 times, qualifying position | ~2,000+ |
All foreign keys are indexed for fast joins across the schema.
| Layer | Technology | Purpose |
|---|---|---|
| Data Extraction | FastF1 API | Official F1 timing data source |
| Transformation | Pandas, NumPy | Data cleaning, normalization, type conversion |
| Data Warehouse | PostgreSQL 15 (Docker) | Persistent star schema storage |
| ORM | SQLAlchemy 2.x | Schema definition, get-or-create patterns |
| Dashboard | Streamlit (multipage) | Interactive web application |
| Visualization | Plotly Express / Graph Objects | Charts, heatmaps, radar plots |
| Styling | Custom CSS injection | Dynamic theming, team palettes, light/dark |
| Testing | pytest | Unit tests with SQLite in-memory fixtures |
| Infrastructure | Docker Compose | Containerized database with health checks |
- Uses the FastF1 library to access the official F1 timing API
- Data is cached locally (
./cache/) to avoid redundant API calls - Rate limit handling: the pipeline detects the 500 calls/hour API limit and gracefully skips affected races, allowing re-execution later to fill gaps
- Team name normalization: historical team names are mapped to their current canonical name (e.g., Renault β Alpine, Toro Rosso β AlphaTauri β RB)
- Time conversion: all
timedeltaobjects are converted tofloatseconds for consistent storage and calculation - Null sanitization: custom
_clean_strhelper convertsNaN,NaT,"nan", and empty strings toNonebefore database insertion - Position/points cleaning: handles edge cases like
0, negative values, and non-numeric strings
- Incremental: checks if a race is already loaded before processing β safe to re-run at any time
- Bulk inserts: uses Pandas
to_sqlwithmethod='multi'andchunksize=100to avoid SQL parameter overflow - Transaction management: dimension records are committed before fact table inserts to satisfy foreign key constraints across separate connections
- Season KPIs: races, drivers, teams, championship leader
- Driver and constructor standings with interactive line charts
- Points evolution across the season with team-colored traces
- Race wins and podiums distribution (horizontal bar charts)
- Deep dive into any individual race (selectable by season + round)
- Full results table, position chart (lap-by-lap), tyre strategy visualization
- Lap time distribution (box plots), qualifying results with gap-to-pole analysis
- Side-by-side KPIs for any two drivers in a season
- Radar chart comparing points, wins, podiums, consistency, and reliability
- Race-by-race comparison table with head-to-head win count
- Dynamic team theming: the entire UI adapts to the selected team's official color palette
- Cumulative points progression and per-race points breakdown
- Teammate battle: points and finish position comparison
- Reliability analysis (pie chart) and grid-vs-finish positions gained/lost
- Lap time evolution per driver (excluding pit laps for clean visualization)
- Tyre degradation scatter plot: lap time vs tyre life by compound
- Best sector times table with theoretical best lap calculation
- Sector heatmap (driver Γ lap) with F1-style purple/yellow/green color scale
- Gap to leader chart across the race distance
The dashboard features a dual-mode theme system (light/dark) with a toggle in the sidebar.
- Background:
#0F1117β deep black inspired by F1 broadcast graphics - Cards: gradient from
#1A1A2Eto#16213E - Accent:
#E10600(official F1 red)
- Background:
#F5F5F7β clean, professional white - Cards: white with subtle gradient
- All text, charts, and UI elements adapt automatically
When a team is selected in the sidebar, all accent colors (tabs, headers, metric borders, chart highlights) switch to the team's official 2024 colors:
| Team | Primary | Secondary | Accent |
|---|---|---|---|
| Red Bull Racing | #3671C6 |
#1B1F27 |
#FFD700 |
| Ferrari | #E80020 |
#FFEB3B |
#FFFFFF |
| Mercedes | #27F4D2 |
#000000 |
#AAAAAA |
| McLaren | #FF8000 |
#47C7FC |
#FFFFFF |
| Aston Martin | #229971 |
#CEDC00 |
#FFFFFF |
| Alpine | #FF87BC |
#0093CC |
#FFFFFF |
| Williams | #64C4FF |
#041E42 |
#FFFFFF |
| RB | #6692FF |
#1B3D73 |
#FFFFFF |
| Kick Sauber | #52E252 |
#000000 |
#FFFFFF |
| Haas | #B6BABD |
#B6181C |
#FFFFFF |
Tyre compounds use official F1 colors: π΄ Soft Β· π‘ Medium Β· βͺ Hard Β· π’ Intermediate Β· π΅ Wet
- Docker & Docker Compose
- Python 3.10+
git clone <repo-url>
cd f1_project
# Optionally edit .env for custom DB credentialsdocker compose up -d
docker compose ps # verify it's healthypip install -r requirements.txt# Full load (2018β2025)
python -m src.etl_engine
# Custom range
python -m src.etl_engine 2023 2025
# Single season
python -m src.etl_engine 2024The ETL is incremental β safe to re-run. Already-loaded races are skipped automatically. First run downloads data from the API and caches it locally. Subsequent runs are much faster. The FastF1 API has a 500 calls/hour rate limit. If hit, the ETL skips affected races and can be re-run later.
streamlit run src/dashboard/app.pyOpens at http://localhost:8501
pytest tests/ -v41 tests covering:
| Test File | Coverage |
|---|---|
tests/test_transformations.py |
Team name normalization, timedelta conversion, position/points cleaning |
tests/test_models.py |
ORM table creation in SQLite in-memory, unique constraints, relationships |
tests/test_etl.py |
Get-or-create dimension patterns, upsert behavior, abbreviation updates |
f1_project/
βββ docker-compose.yml # PostgreSQL 15 Β· persistent volume Β· healthcheck
βββ .env # DB credentials (gitignored)
βββ .gitignore
βββ .streamlit/config.toml # Streamlit theme config
βββ requirements.txt # Python dependencies
βββ README.md
β
βββ src/
β βββ config.py # DB connection, cache path, ETL parameters
β βββ models.py # SQLAlchemy ORM (star schema)
β βββ etl_engine.py # ETL pipeline: extract β transform β load
β βββ transformations.py # Normalization, cleaning, conversion
β βββ logger.py # Dual logging (console + file)
β βββ dashboard/
β βββ app.py # Streamlit entry point (multipage)
β βββ data.py # Data access layer (cached queries)
β βββ theme.py # Palettes, CSS injection, light/dark
β βββ components.py # KPI cards, chart builders, tables
β βββ pages/
β βββ 01_overview.py # Championship Overview
β βββ 02_race_analysis.py # Race Analysis
β βββ 03_driver_compare.py # Driver Comparison (H2H)
β βββ 04_team_performance.py# Team Performance
β βββ 05_lap_analysis.py # Lap & Strategy Analysis
β
βββ tests/
β βββ conftest.py # pytest fixtures (SQLite in-memory)
β βββ test_transformations.py
β βββ test_models.py
β βββ test_etl.py
β
βββ cache/ # FastF1 API cache (gitignored)
| Challenge | Solution |
|---|---|
| FastF1 API rate limit (500 calls/hour) | Implemented graceful skip-on-limit with incremental loading β re-run fills gaps |
psycopg2 SQL parameter overflow on bulk inserts |
Reduced chunksize from 500 to 100 in pandas.to_sql |
| Foreign key violations during fact table loading | Changed session.flush() to session.commit() before to_sql calls (separate connections) |
NaN strings persisted in compound fields |
Built _clean_str helper to sanitize NaN, NaT, "nan", empty strings β None |
| Historical team name inconsistencies | Created normalization mapping (30+ aliases β 10 canonical names) |
| PostgreSQL timezone configuration errors | Created timezone symlink in Docker entrypoint for client compatibility |
| Streamlit subprocess import resolution | Added sys.path injection at the top of each page script |
- Open Power BI Desktop β Get Data β PostgreSQL
- Server:
localhost:5432Β· Database:f1_warehouseΒ· User:f1admin - Select the star schema tables
- Verify relationships on
driver_id,team_id,race_id,circuit_id
- Team normalization: 30+ historical names mapped to current canonical names (Renault β Alpine, Toro Rosso β RB, etc.)
- Time storage: all
timedeltavalues converted tofloatseconds for consistent arithmetic - Bulk loading:
method='multi'withchunksize=100prevents SQL parameter overflow - Index strategy: all foreign keys indexed for fast analytical joins across 100k+ rows
- Query caching: Streamlit
@st.cache_data(ttl=600)prevents redundant DB queries - Persistent storage: PostgreSQL data stored in Docker volume
f1_pgdata - Typography: Inter font family (Google Fonts) with weights 300β900 for professional appearance