Entity Match Platform

A production-ready data pipeline for entity matching, web data enrichment, and analytics visualization using Apache Airflow, PySpark, and Docker.

🎯 Overview

This platform enables organizations to:

Match entities across different data sources using fuzzy matching algorithms
Enrich data by scraping external sources for additional information
Visualize results through an interactive analytics dashboard
Track matching quality with comprehensive metrics and tracing

Perfect for:

Customer data deduplication
Third-party data integration
Entity resolution across databases
Data quality improvement initiatives
Master data management (MDM)

🏗️ Architecture

┌─────────────────┐
│   Source A      │ (Internal Database)
│  (Primary Data) │
└────────┬────────┘
         │
         ├──────────► ┌─────────────────────┐
         │            │  Data Preparation   │
         │            │  - Normalization    │
┌────────▼────────┐  │  - Deduplication    │
│   Source B      │──┤  - Web Enrichment   │
│ (External Data) │  └──────────┬──────────┘
└─────────────────┘             │
                                │
                    ┌───────────▼──────────┐
                    │   Entity Matching    │
                    │  - Exact Matching    │
                    │  - LSH Fuzzy Match   │
                    │  - Multi-stage       │
                    └───────────┬──────────┘
                                │
                    ┌───────────▼──────────┐
                    │    Visualization     │
                    │  - Match Analytics   │
                    │  - Quality Metrics   │
                    └──────────────────────┘

✨ Features

1. Multi-Stage Matching Engine

Exact Matching: Fast matching on unique identifiers (tax IDs, registration numbers)
Fuzzy Matching: LSH-based similarity matching for name fields
Weighted Scoring: Configurable weights for different field importance
Deduplication: Automatic removal of duplicate matches

2. Web Data Enrichment

Configurable web scraping for external data sources
Intelligent search strategies (by ID, name, etc.)
Automatic data validation and cleaning
Rate limiting and error handling

3. Production-Ready Pipeline

Docker containerization for consistent deployment
Apache Airflow for workflow orchestration
PySpark for distributed data processing
Incremental processing to avoid reprocessing
Tracing & auditing of all matching attempts

4. Interactive Dashboard

Real-time matching statistics
Data quality metrics
Field-level mismatch analysis
Tracing analytics with attempt history

🚀 Quick Start

Prerequisites

Docker & Docker Compose
8GB+ RAM recommended
Python 3.9+

Installation

Clone the repository

git clone https://github.com/Chaimaaorg/entity-match-platform.git
cd entity-match-platform

Configure environment

cp .env.example .env
# Edit .env with your settings

Update configuration Edit config/config.ini:

[dev]
db = local_db
source_a_main = /app/data/source_a/entities.csv
source_b_main = /app/data/source_b/entities.csv

Start the platform

docker-compose up -d

Access services

Airflow UI: http://localhost:8080 (user: admin, password: admin)
Dashboard: Open visualization/analytics_dashboard.html in browser

Running the Pipeline

Prepare your data
- Place Source A data in data/source_a/
- Place Source B data in data/source_b/
- Ensure CSV files have proper headers
Trigger the DAG
- Open Airflow UI
- Enable the entity_matching_pipeline DAG
- Trigger manually or wait for scheduled run
View results
- Matched records: data/matched/results.csv
- Processing logs: Check Airflow task logs
- Analytics: Load CSV files into the dashboard

📊 Data Format

Source A (Primary Dataset)

entity_id,name,tax_id,registration_number,city
001,Acme Corp,TX123456,REG789,New York
002,Beta LLC,TX234567,REG890,Los Angeles

Source B (Secondary Dataset)

entity_id,name,tax_id,registration_number,activity
B001,Acme Corporation,TX123456,REG789,Manufacturing
B002,Beta Limited,TX234567,REG890,Retail

Required Fields

Unique identifier (entity_id, company_id, etc.)
Name field (company name, organization name, etc.)
Optional identifiers (tax_id, registration_number, etc.)

⚙️ Configuration

Matching Parameters (`src/matching/matcher.py`)

@dataclass
class MatchingConfig:
    lsh_num_hash_tables: int = 3
    lsh_distance_threshold: float = 0.5
    similarity_threshold: float = 0.65
    max_candidates_per_record: int = 10
    
    weights: Dict[str, float] = {
        "name": 0.45,
        "registration_id": 0.25,
        "tax_id": 0.25,
        "other": 0.05
    }

Web Enrichment (`src/enrichment/web_scraper.py`)

enricher = EntityEnricher(
    base_url="https://your-data-source.com/search",
    search_params_mapping={
        "tax_id": "tax_param",
        "registration_id": "reg_param",
        "name": "name_param"
    }
)

📁 Project Structure

entity-match-platform/
├── airflow/                 # Airflow DAGs and configuration
│   ├── dags/
│   │   └── entity_matching_pipeline.py
│   └── Dockerfile
├── config/                  # Configuration files
│   └── config.ini
├── src/                     # Source code
│   ├── enrichment/          # Web scraping modules
│   │   └── web_scraper.py
│   └── matching/            # Matching engine
│       ├── data_preparation.py
│       ├── matcher.py
│       └── utils.py
├── data/                    # Data storage
│   ├── source_a/            # Primary dataset
│   ├── source_b/            # Secondary dataset
│   ├── processed/           # Cleaned data
│   └── matched/             # Matching results
├── visualization/           # Analytics dashboard
│   └── analytics_dashboard.html
└── tests/                   # Unit tests

🔧 Customization

Adding New Data Sources

Update config/config.ini with new paths
Create data loaders in src/matching/data_preparation.py
Adjust field mappings in matching configuration

Custom Matching Logic

Override methods in src/matching/matcher.py:

class CustomMatcher(LSHMatcher):
    def _build_feature_pipeline(self, company_col: str):
        # Custom text processing logic
        pass

Web Scraping Configuration

Implement scrape_entities_from_html() for your target website:

def scrape_entities_from_html(self, html: str):
    # Parse HTML specific to your data source
    soup = BeautifulSoup(html, "html.parser")
    # Extract entity information
    return entities

🤝 Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request

📄 License

MIT License - see LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
airflow		airflow
config		config
data		data
scripts		scripts
visualization		visualization
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Entity Match Platform

🎯 Overview

🏗️ Architecture

✨ Features

1. Multi-Stage Matching Engine

2. Web Data Enrichment

3. Production-Ready Pipeline

4. Interactive Dashboard

🚀 Quick Start

Prerequisites

Installation

Running the Pipeline

📊 Data Format

Source A (Primary Dataset)

Source B (Secondary Dataset)

Required Fields

⚙️ Configuration

Matching Parameters (`src/matching/matcher.py`)

Web Enrichment (`src/enrichment/web_scraper.py`)

📁 Project Structure

🔧 Customization

Adding New Data Sources

Custom Matching Logic

Web Scraping Configuration

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Entity Match Platform

🎯 Overview

🏗️ Architecture

✨ Features

1. Multi-Stage Matching Engine

2. Web Data Enrichment

3. Production-Ready Pipeline

4. Interactive Dashboard

🚀 Quick Start

Prerequisites

Installation

Running the Pipeline

📊 Data Format

Source A (Primary Dataset)

Source B (Secondary Dataset)

Required Fields

⚙️ Configuration

Matching Parameters (src/matching/matcher.py)

Web Enrichment (src/enrichment/web_scraper.py)

📁 Project Structure

🔧 Customization

Adding New Data Sources

Custom Matching Logic

Web Scraping Configuration

🤝 Contributing

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Matching Parameters (`src/matching/matcher.py`)

Web Enrichment (`src/enrichment/web_scraper.py`)

Packages