- UV - Fast Python package manager (manages Python versions automatically)
-
Install UV (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | sh -
Clone the repository:
git clone <repository-url> cd igh-data-transform
-
Install Python and dependencies:
# UV will automatically install Python 3.12 if needed uv sync
You can run the CLI tool without activating the virtual environment using uv run:
# Show available commands
uv run igh-transform --helpAlternatively, activate the virtual environment first:
source .venv/bin/activate # On Linux/Mac
# or
.venv\Scripts\activate # On Windows
# Then run normally
igh-transform --helpTransform raw Bronze layer data to cleaned Silver layer:
uv run igh-transform bronze-to-silver --bronze-db ./data/bronze.db --silver-db ./data/silver.dbThis applies cleanup transformations:
- Drops columns that are entirely null (preserves
valid_from/valid_to) - Normalizes whitespace in text fields
- Ready for table-specific column renames and value mappings
Transform Silver layer to a star schema Gold layer (dimensions, facts, bridges):
uv run igh-transform silver-to-gold --silver-db ./data/silver.db --gold-db ./data/star_schema.dbTwo wrapper scripts run the full pipeline and copy the resulting star schema database to the backend.
Syncs data from Dataverse, then runs Bronze -> Silver -> Gold -> Backend:
# Fresh sync (default) - deletes existing bronze DB and syncs from scratch
./sync-and-run-etl.sh
# Incremental sync - keeps existing bronze DB
./sync-and-run-etl.sh --update
# Skip sync entirely - use an existing bronze DB
./sync-and-run-etl.sh --skip-sync
# Use a custom .env file for Dataverse credentials
./sync-and-run-etl.sh --env-file /path/to/.envRuns the transformation pipeline on an existing bronze DB (no Dataverse sync):
# Use the default bronze DB path (data/dataverse_complete_raw.db)
./run-etl.sh
# Use a custom bronze DB path
./run-etl.sh /path/to/bronze.dbBoth scripts produce star_schema.db and copy it to ../backend/ and ../backend/tests/.
This project uses igh-data-sync to pull data from Microsoft Dataverse before applying transformations.
Setup:
-
Configure environment variables - Create a
.envfile with your Dataverse credentials:CLIENT_ID=your-azure-client-id CLIENT_SECRET=your-azure-client-secret SCOPE=https://your-org.crm.dynamics.com/.default API_URL=https://your-org.api.crm.dynamics.com/api/data/v9.2/ SQLITE_DB_PATH=./data/dataverse.db
-
Run the sync - Pull data from Dataverse to local SQLite:
uv run sync-dataverse
-
Verify the data (optional) - Check foreign key integrity:
uv run sync-dataverse --verify
The synced data will be stored in a SQLite database with SCD2 (Slowly Changing Dimension Type 2) versioning for historical tracking.
The project uses UV for dependency management. Common commands:
- Add a dependency:
uv add <package-name> - Add a dev dependency:
uv add --dev <package-name> - Update dependencies:
uv sync - Run commands without activating venv:
uv run <command> - Run unit tests:
uv run pytest - Run e2e tests:
E2E_BRONZE_DB_PATH=/path/to/bronze.db uv run pytest --e2e -v - Run all tests:
E2E_BRONZE_DB_PATH=/path/to/bronze.db uv run pytest --all -v - Run tests with coverage:
uv run pytest --cov=igh_data_transform --cov-report=term-missing - Run linter:
uv run ruff check src/ tests/
- Adding Transformations - Guide for data analysts on how to add new data transformations