Skip to content

🏎 F1 Analytics Hub β€” End-to-end data engineering platform that extracts 7+ years of Formula 1 data from the FastF1 API, models it in a PostgreSQL star schema (Kimball), and delivers interactive analytics through a Streamlit dashboard with 5 views, dynamic team theming, and light/dark mode.

Notifications You must be signed in to change notification settings

krushodev/f1-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🏎 F1 Analytics Hub β€” Historical Race Intelligence Platform

End-to-end data engineering & analytics project that extracts 7+ years of Formula 1 race data, models it in a dimensional data warehouse, and delivers interactive visualizations through a custom-built dashboard.

Python PostgreSQL Streamlit Docker


Project Overview

This project demonstrates a complete data engineering pipeline β€” from raw API extraction to a polished, interactive analytics dashboard. It processes historical Formula 1 data spanning the 2018–2025 seasons, covering race results, lap-by-lap telemetry, pit stop strategies, and qualifying sessions.

Key highlights:

  • ETL Pipeline built in Python that extracts data from the FastF1 API, applies business transformations (team name normalization, time conversions, null handling), and loads it into a PostgreSQL star schema
  • Dimensional Data Model with 4 dimension tables and 4 fact tables, optimized with indexes for analytical queries across 100k+ lap records
  • Interactive Dashboard with 5 analytical views, dynamic team-based theming, and light/dark mode support
  • Incremental Loading β€” the ETL detects already-loaded races and skips them, with graceful API rate limit handling
  • 41 automated tests covering transformations, ORM models, and ETL logic

Architecture

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   FastF1 API    β”‚  Historical F1 data source
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚ Extract (Python Β· requests Β· caching)
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   ETL Engine    β”‚  Pandas Β· SQLAlchemy Β· Custom transforms
                    β”‚   (Transform)   β”‚  Team normalization Β· Time conversion
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜  Null handling Β· Incremental logic
                             β”‚ Load (bulk insert Β· chunksize=100)
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   PostgreSQL    β”‚  Docker container Β· Star Schema
                    β”‚   15 (DW)      β”‚  4 dimensions Β· 4 fact tables
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜  Indexed FKs Β· Persistent volume
                             β”‚ Query (SQLAlchemy Β· cached)
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   Streamlit     β”‚  5 analytical pages
                    β”‚   Dashboard     β”‚  Plotly charts Β· Dynamic theming
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  Light/Dark mode Β· Team palettes

Data Model β€” Star Schema

The data warehouse follows a Kimball-style star schema designed for analytical queries:

Dimension Tables

Table Description Key Fields
dim_drivers Driver master data full_name, abbreviation, nationality, number
dim_teams Team/constructor data with canonical names team_name, team_color
dim_circuits Circuit information circuit_name, country, locality
dim_calendar Race calendar linking year, round, circuit, date, and format (sprint/race) year, round, circuit_id, event_date

Fact Tables

Table Grain Description Records
fact_results 1 row per driver/race Final position, points, grid, status, fastest lap time, total race time ~2,000+
fact_laps 1 row per driver/lap Sector times, compound, tyre life, position, pit in/out flags ~100,000+
fact_pit_stops 1 row per pit stop Stop duration, compound before/after, lap number ~5,000+
fact_qualifying 1 row per driver/race Q1/Q2/Q3 times, qualifying position ~2,000+

All foreign keys are indexed for fast joins across the schema.


Technology Stack

Layer Technology Purpose
Data Extraction FastF1 API Official F1 timing data source
Transformation Pandas, NumPy Data cleaning, normalization, type conversion
Data Warehouse PostgreSQL 15 (Docker) Persistent star schema storage
ORM SQLAlchemy 2.x Schema definition, get-or-create patterns
Dashboard Streamlit (multipage) Interactive web application
Visualization Plotly Express / Graph Objects Charts, heatmaps, radar plots
Styling Custom CSS injection Dynamic theming, team palettes, light/dark
Testing pytest Unit tests with SQLite in-memory fixtures
Infrastructure Docker Compose Containerized database with health checks

ETL Pipeline β€” Design Decisions

Extraction

  • Uses the FastF1 library to access the official F1 timing API
  • Data is cached locally (./cache/) to avoid redundant API calls
  • Rate limit handling: the pipeline detects the 500 calls/hour API limit and gracefully skips affected races, allowing re-execution later to fill gaps

Transformation

  • Team name normalization: historical team names are mapped to their current canonical name (e.g., Renault β†’ Alpine, Toro Rosso β†’ AlphaTauri β†’ RB)
  • Time conversion: all timedelta objects are converted to float seconds for consistent storage and calculation
  • Null sanitization: custom _clean_str helper converts NaN, NaT, "nan", and empty strings to None before database insertion
  • Position/points cleaning: handles edge cases like 0, negative values, and non-numeric strings

Loading

  • Incremental: checks if a race is already loaded before processing β€” safe to re-run at any time
  • Bulk inserts: uses Pandas to_sql with method='multi' and chunksize=100 to avoid SQL parameter overflow
  • Transaction management: dimension records are committed before fact table inserts to satisfy foreign key constraints across separate connections

Dashboard β€” 5 Analytical Views

πŸ† Championship Overview

  • Season KPIs: races, drivers, teams, championship leader
  • Driver and constructor standings with interactive line charts
  • Points evolution across the season with team-colored traces
  • Race wins and podiums distribution (horizontal bar charts)

🏁 Race Analysis

  • Deep dive into any individual race (selectable by season + round)
  • Full results table, position chart (lap-by-lap), tyre strategy visualization
  • Lap time distribution (box plots), qualifying results with gap-to-pole analysis

πŸ‘₯ Driver Comparison (Head-to-Head)

  • Side-by-side KPIs for any two drivers in a season
  • Radar chart comparing points, wins, podiums, consistency, and reliability
  • Race-by-race comparison table with head-to-head win count

🏒 Team Performance

  • Dynamic team theming: the entire UI adapts to the selected team's official color palette
  • Cumulative points progression and per-race points breakdown
  • Teammate battle: points and finish position comparison
  • Reliability analysis (pie chart) and grid-vs-finish positions gained/lost

⏱ Lap & Strategy Analysis

  • Lap time evolution per driver (excluding pit laps for clean visualization)
  • Tyre degradation scatter plot: lap time vs tyre life by compound
  • Best sector times table with theoretical best lap calculation
  • Sector heatmap (driver Γ— lap) with F1-style purple/yellow/green color scale
  • Gap to leader chart across the race distance

Theming System

The dashboard features a dual-mode theme system (light/dark) with a toggle in the sidebar.

Dark Mode (Default)

  • Background: #0F1117 β€” deep black inspired by F1 broadcast graphics
  • Cards: gradient from #1A1A2E to #16213E
  • Accent: #E10600 (official F1 red)

Light Mode

  • Background: #F5F5F7 β€” clean, professional white
  • Cards: white with subtle gradient
  • All text, charts, and UI elements adapt automatically

Team Color Palettes

When a team is selected in the sidebar, all accent colors (tabs, headers, metric borders, chart highlights) switch to the team's official 2024 colors:

Team Primary Secondary Accent
Red Bull Racing #3671C6 #1B1F27 #FFD700
Ferrari #E80020 #FFEB3B #FFFFFF
Mercedes #27F4D2 #000000 #AAAAAA
McLaren #FF8000 #47C7FC #FFFFFF
Aston Martin #229971 #CEDC00 #FFFFFF
Alpine #FF87BC #0093CC #FFFFFF
Williams #64C4FF #041E42 #FFFFFF
RB #6692FF #1B3D73 #FFFFFF
Kick Sauber #52E252 #000000 #FFFFFF
Haas #B6BABD #B6181C #FFFFFF

Tyre compounds use official F1 colors: πŸ”΄ Soft Β· 🟑 Medium Β· βšͺ Hard Β· 🟒 Intermediate Β· πŸ”΅ Wet


Quick Start

Prerequisites

  • Docker & Docker Compose
  • Python 3.10+

1. Clone & Configure

git clone <repo-url>
cd f1_project
# Optionally edit .env for custom DB credentials

2. Start PostgreSQL

docker compose up -d
docker compose ps   # verify it's healthy

3. Install Dependencies

pip install -r requirements.txt

4. Run the ETL

# Full load (2018–2025)
python -m src.etl_engine

# Custom range
python -m src.etl_engine 2023 2025

# Single season
python -m src.etl_engine 2024

The ETL is incremental β€” safe to re-run. Already-loaded races are skipped automatically. First run downloads data from the API and caches it locally. Subsequent runs are much faster. The FastF1 API has a 500 calls/hour rate limit. If hit, the ETL skips affected races and can be re-run later.

5. Launch the Dashboard

streamlit run src/dashboard/app.py

Opens at http://localhost:8501


Testing

pytest tests/ -v

41 tests covering:

Test File Coverage
tests/test_transformations.py Team name normalization, timedelta conversion, position/points cleaning
tests/test_models.py ORM table creation in SQLite in-memory, unique constraints, relationships
tests/test_etl.py Get-or-create dimension patterns, upsert behavior, abbreviation updates

Project Structure

f1_project/
β”œβ”€β”€ docker-compose.yml                # PostgreSQL 15 Β· persistent volume Β· healthcheck
β”œβ”€β”€ .env                              # DB credentials (gitignored)
β”œβ”€β”€ .gitignore
β”œβ”€β”€ .streamlit/config.toml            # Streamlit theme config
β”œβ”€β”€ requirements.txt                  # Python dependencies
β”œβ”€β”€ README.md
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config.py                     # DB connection, cache path, ETL parameters
β”‚   β”œβ”€β”€ models.py                     # SQLAlchemy ORM (star schema)
β”‚   β”œβ”€β”€ etl_engine.py                 # ETL pipeline: extract β†’ transform β†’ load
β”‚   β”œβ”€β”€ transformations.py            # Normalization, cleaning, conversion
β”‚   β”œβ”€β”€ logger.py                     # Dual logging (console + file)
β”‚   └── dashboard/
β”‚       β”œβ”€β”€ app.py                    # Streamlit entry point (multipage)
β”‚       β”œβ”€β”€ data.py                   # Data access layer (cached queries)
β”‚       β”œβ”€β”€ theme.py                  # Palettes, CSS injection, light/dark
β”‚       β”œβ”€β”€ components.py             # KPI cards, chart builders, tables
β”‚       └── pages/
β”‚           β”œβ”€β”€ 01_overview.py        # Championship Overview
β”‚           β”œβ”€β”€ 02_race_analysis.py   # Race Analysis
β”‚           β”œβ”€β”€ 03_driver_compare.py  # Driver Comparison (H2H)
β”‚           β”œβ”€β”€ 04_team_performance.py# Team Performance
β”‚           └── 05_lap_analysis.py    # Lap & Strategy Analysis
β”‚
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ conftest.py                   # pytest fixtures (SQLite in-memory)
β”‚   β”œβ”€β”€ test_transformations.py
β”‚   β”œβ”€β”€ test_models.py
β”‚   └── test_etl.py
β”‚
└── cache/                            # FastF1 API cache (gitignored)

Challenges & Solutions

Challenge Solution
FastF1 API rate limit (500 calls/hour) Implemented graceful skip-on-limit with incremental loading β€” re-run fills gaps
psycopg2 SQL parameter overflow on bulk inserts Reduced chunksize from 500 to 100 in pandas.to_sql
Foreign key violations during fact table loading Changed session.flush() to session.commit() before to_sql calls (separate connections)
NaN strings persisted in compound fields Built _clean_str helper to sanitize NaN, NaT, "nan", empty strings β†’ None
Historical team name inconsistencies Created normalization mapping (30+ aliases β†’ 10 canonical names)
PostgreSQL timezone configuration errors Created timezone symlink in Docker entrypoint for client compatibility
Streamlit subprocess import resolution Added sys.path injection at the top of each page script

Connect Power BI (Optional)

  1. Open Power BI Desktop β†’ Get Data β†’ PostgreSQL
  2. Server: localhost:5432 Β· Database: f1_warehouse Β· User: f1admin
  3. Select the star schema tables
  4. Verify relationships on driver_id, team_id, race_id, circuit_id

Technical Notes

  • Team normalization: 30+ historical names mapped to current canonical names (Renault β†’ Alpine, Toro Rosso β†’ RB, etc.)
  • Time storage: all timedelta values converted to float seconds for consistent arithmetic
  • Bulk loading: method='multi' with chunksize=100 prevents SQL parameter overflow
  • Index strategy: all foreign keys indexed for fast analytical joins across 100k+ rows
  • Query caching: Streamlit @st.cache_data(ttl=600) prevents redundant DB queries
  • Persistent storage: PostgreSQL data stored in Docker volume f1_pgdata
  • Typography: Inter font family (Google Fonts) with weights 300–900 for professional appearance

About

🏎 F1 Analytics Hub β€” End-to-end data engineering platform that extracts 7+ years of Formula 1 data from the FastF1 API, models it in a PostgreSQL star schema (Kimball), and delivers interactive analytics through a Streamlit dashboard with 5 views, dynamic team theming, and light/dark mode.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages