Build an end-to-end modern data engineering pipeline on Azure to ingest AdventureWorks :
- data, transform it using a Medallion (Bronze → Silver → Gold) architecture, and serve analytics-ready datasets to Power BI
This project demonstrates how to:
- Ingest raw data into ADLS Gen2 using Azure Data Factory : (Bronze)
- Transform and standardize data using Azure Databricks : + PySpark (Silver)
- Publish curated views using Azure Synapse Analytics (Serverless SQL) : (Gold)
- Build dashboards/reports in Power BI using the Gold layer
- Source: GitHub (HTTP/HTTPS raw links) :contentReference[oaicite:5]{index=5}
- Ingest CSV files into Azure Data Lake Storage Gen2 :contentReference[oaicite:6]{index=6} Bronze container (as-is)
- Pipeline is metadata-driven using a JSON config (Lookup → ForEach → Copy)
- Transform Bronze CSVs in Databricks using Apache Spark :contentReference[oaicite:7]{index=7} / PySpark
- Write standardized outputs to ADLS Silver as Parquet (Snappy)
- Use Synapse Serverless SQL to create curated views directly on Silver Parquet using:
OPENROWSET(BULK..., FORMAT='PARQUET')
- Power BI connects to Gold views for reporting
- Azure Data Factory (Bronze ingestion)
- Azure Data Lake Storage Gen2 (Bronze/Silver/Gold storage)
- Azure Databricks (Spark compute + ETL)
- Apache Spark / PySpark (transformations)
- Azure Synapse Analytics (Serverless SQL) (Gold views)
- Power BI (dashboarding)
- GitHub (version control + documentation)
- Medallion architecture design (Bronze/Silver/Gold)
- Metadata-driven ingestion (Lookup + ForEach + parameterized datasets in ADF)
- Scalable ETL with PySpark on Databricks
- Parquet optimization (Snappy) for analytics
- Synapse Serverless querying with
OPENROWSET - Curated semantic datasets for BI consumption
- Power BI modeling and KPI reporting
AdventureWorks CSV files used:
- Calendar
- Customers
- Product Categories
- Product Subcategories
- Products
- Returns
- Territories
- Sales (2015 / 2016 / 2017)
Created required Azure resources:
- Resource Group
- ADLS Gen2 Storage Account + containers:
bronze/,silver/,gold/ - Azure Data Factory
- Azure Databricks
- Azure Synapse Analytics workspace
The ingestion layer is fully metadata-driven. Instead of creating one pipeline per file, Azure Data Factory reads a JSON configuration file that contains the list of source files and target paths.
How it works:
- ADF reads a JSON config using Lookup Activity (returns an array of file metadata).
- A ForEach Activity loops through each JSON item.
- Inside the loop, a Copy Activity pulls data from the GitHub HTTP link and lands it in ADLS Gen2 (Bronze) using dynamic folder + filename.
JSON config fields:
re_url→ GitHub raw URL for the source CSV filep_directory→ destination folder name in ADLS (bronze layer)p_filename→ destination file name in ADLS
Dynamic behavior (high-level):
- Source URL driven by:
@item().re_url - Sink folder driven by:
@item().p_directory - Sink filename driven by:
@item().p_filename
Key ADF components used:
- Lookup (JSON config)
- ForEach (iterate config array)
- Copy Activity (HTTP → ADLS)
- Parameterized datasets (URL / folder / file name)
Why this design?
- Adding a new dataset requires only updating the JSON file—no pipeline redesign.
- Read Bronze CSVs from ADLS
- Apply transformations/standardization
- Write outputs to ADLS Silver as Parquet (Snappy)
- Calendar: derived
MonthandYearfromDate - Products: cleaned
ProductSKU(substring before '-') and simplifiedProductName(first word) - Sales: converted
StockDateto timestamp, standardizedOrderNumber(S → T), and createdTotalQuantity = OrderLineItem × OrderQuantity - Converted CSV → Parquet (Snappy) for optimized Serverless querying
- Created schema:
gold - Built Gold views on Silver Parquet using
OPENROWSET
Gold views created:
gold.calendargold.salesgold.territoriesgold.customersgold.product_subcategoriesgold.productsgold.returns
- Connected Power BI to Synapse Serverless Gold views
- Built visuals and KPIs (example dashboards)
- Published final report/dashboard
- Orders by Year (2015–2017)
- Product Cost distribution by Product Name
- (Optional) Sales by Territory / Returns analysis
- Update
config/ingestion_config.jsonwith GitHub raw URLs (re_url) and desired output paths (p_directory,p_filename). - Run the ADF pipeline (Lookup → ForEach → Copy) to load Bronze into ADLS.
- Run the Databricks notebook to transform Bronze → Silver (Parquet/Snappy).
- Execute Synapse SQL scripts to create
goldschema and views. - Open Power BI report and refresh using Gold views.


















