Skip to content

AndreaBozzo/LakehouseStarterKit

Repository files navigation

🧩 Open Lakehouse Starter

screen_01

Open source environment for small teams & startups, modelled to be:

  • light,
  • expandable,
  • scalable (S3, Spark, orchestrators, Databricks, etc.).

Structure

  • dlt/: Python ETL pipeline for data ingestion
  • dbt/: SQL models and transformations
  • superset/: dashboard & visualizations
  • minio: S3-compatible object storage

Prerequisites

  • Docker and Docker Compose
  • Python 3.9 or higher
  • pip (Python package manager)

Installation

1. Clone the repository

git clone <your-repo-url>
cd LakehouseStarterKit

2. Create and activate a virtual environment

python -m venv .venv
# On Windows:
.venv\Scripts\activate
# On Linux/Mac:
source .venv/bin/activate

3. Install Python dependencies

pip install -r requirements.txt

4. Start Docker services

docker-compose up -d

This will start:

Wait a few minutes for Superset to complete initialization.

Quick Start

Run the data pipeline

# Load data from API to DuckDB using dlt
python dlt/pipelines/example_api.py

# Transform data using dbt
cd dbt
dbt run

Access the tools

Project Structure

.
├── dlt/                    # Data ingestion with dlt
│   ├── pipelines/
│   │   └── example_api.py  # Example pipeline fetching public APIs data
│   └── dlt.config.toml     # dlt configuration
├── dbt/                    # Data transformation with dbt
│   ├── models/
│   │   └── staging/
│   │       ├── sources.yml        # Source definitions
│   │       ├── schema.yml         # Model documentation
│   │       └── example_model.sql  # Example transformation
│   ├── dbt_project.yml
│   └── profiles.yml
├── superset/               # Superset configuration
│   └── superset_config.py
├── docker-compose.yml      # Docker services definition
├── requirements.txt        # Python dependencies
└── .env                    # Environment variables

Next Steps

  • Explore the data in DuckDB at openlakehouse_demo.duckdb
  • Create your own dlt pipelines in dlt/pipelines/
  • Add dbt models in dbt/models/
  • Connect Superset to DuckDB and create dashboards
  • Scale up by connecting to S3 (MinIO), adding Spark, or integrating orchestrators

License

MIT License - see LICENSE file for details

About

Lakehouse starter kit for small teams (1-5). Extract data with dlt, transform with dbt, visualize with Superset. S3-compatible storage with MinIO. Easy to scale.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Contributors

Languages