Skip to content

Chaimaaorg/entity-match-platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Entity Match Platform

A production-ready data pipeline for entity matching, web data enrichment, and analytics visualization using Apache Airflow, PySpark, and Docker.

🎯 Overview

This platform enables organizations to:

  • Match entities across different data sources using fuzzy matching algorithms
  • Enrich data by scraping external sources for additional information
  • Visualize results through an interactive analytics dashboard
  • Track matching quality with comprehensive metrics and tracing

Perfect for:

  • Customer data deduplication
  • Third-party data integration
  • Entity resolution across databases
  • Data quality improvement initiatives
  • Master data management (MDM)

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Source A      β”‚ (Internal Database)
β”‚  (Primary Data) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Ί β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚            β”‚  Data Preparation   β”‚
         β”‚            β”‚  - Normalization    β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  - Deduplication    β”‚
β”‚   Source B      │───  - Web Enrichment   β”‚
β”‚ (External Data) β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
                                β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   Entity Matching    β”‚
                    β”‚  - Exact Matching    β”‚
                    β”‚  - LSH Fuzzy Match   β”‚
                    β”‚  - Multi-stage       β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚    Visualization     β”‚
                    β”‚  - Match Analytics   β”‚
                    β”‚  - Quality Metrics   β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

✨ Features

1. Multi-Stage Matching Engine

  • Exact Matching: Fast matching on unique identifiers (tax IDs, registration numbers)
  • Fuzzy Matching: LSH-based similarity matching for name fields
  • Weighted Scoring: Configurable weights for different field importance
  • Deduplication: Automatic removal of duplicate matches

2. Web Data Enrichment

  • Configurable web scraping for external data sources
  • Intelligent search strategies (by ID, name, etc.)
  • Automatic data validation and cleaning
  • Rate limiting and error handling

3. Production-Ready Pipeline

  • Docker containerization for consistent deployment
  • Apache Airflow for workflow orchestration
  • PySpark for distributed data processing
  • Incremental processing to avoid reprocessing
  • Tracing & auditing of all matching attempts

4. Interactive Dashboard

  • Real-time matching statistics
  • Data quality metrics
  • Field-level mismatch analysis
  • Tracing analytics with attempt history

πŸš€ Quick Start

Prerequisites

  • Docker & Docker Compose
  • 8GB+ RAM recommended
  • Python 3.9+

Installation

  1. Clone the repository
git clone https://github.com/Chaimaaorg/entity-match-platform.git
cd entity-match-platform
  1. Configure environment
cp .env.example .env
# Edit .env with your settings
  1. Update configuration Edit config/config.ini:
[dev]
db = local_db
source_a_main = /app/data/source_a/entities.csv
source_b_main = /app/data/source_b/entities.csv
  1. Start the platform
docker-compose up -d
  1. Access services
  • Airflow UI: http://localhost:8080 (user: admin, password: admin)
  • Dashboard: Open visualization/analytics_dashboard.html in browser

Running the Pipeline

  1. Prepare your data

    • Place Source A data in data/source_a/
    • Place Source B data in data/source_b/
    • Ensure CSV files have proper headers
  2. Trigger the DAG

    • Open Airflow UI
    • Enable the entity_matching_pipeline DAG
    • Trigger manually or wait for scheduled run
  3. View results

    • Matched records: data/matched/results.csv
    • Processing logs: Check Airflow task logs
    • Analytics: Load CSV files into the dashboard

πŸ“Š Data Format

Source A (Primary Dataset)

entity_id,name,tax_id,registration_number,city
001,Acme Corp,TX123456,REG789,New York
002,Beta LLC,TX234567,REG890,Los Angeles

Source B (Secondary Dataset)

entity_id,name,tax_id,registration_number,activity
B001,Acme Corporation,TX123456,REG789,Manufacturing
B002,Beta Limited,TX234567,REG890,Retail

Required Fields

  • Unique identifier (entity_id, company_id, etc.)
  • Name field (company name, organization name, etc.)
  • Optional identifiers (tax_id, registration_number, etc.)

βš™οΈ Configuration

Matching Parameters (src/matching/matcher.py)

@dataclass
class MatchingConfig:
    lsh_num_hash_tables: int = 3
    lsh_distance_threshold: float = 0.5
    similarity_threshold: float = 0.65
    max_candidates_per_record: int = 10
    
    weights: Dict[str, float] = {
        "name": 0.45,
        "registration_id": 0.25,
        "tax_id": 0.25,
        "other": 0.05
    }

Web Enrichment (src/enrichment/web_scraper.py)

enricher = EntityEnricher(
    base_url="https://your-data-source.com/search",
    search_params_mapping={
        "tax_id": "tax_param",
        "registration_id": "reg_param",
        "name": "name_param"
    }
)

πŸ“ Project Structure

entity-match-platform/
β”œβ”€β”€ airflow/                 # Airflow DAGs and configuration
β”‚   β”œβ”€β”€ dags/
β”‚   β”‚   └── entity_matching_pipeline.py
β”‚   └── Dockerfile
β”œβ”€β”€ config/                  # Configuration files
β”‚   └── config.ini
β”œβ”€β”€ src/                     # Source code
β”‚   β”œβ”€β”€ enrichment/          # Web scraping modules
β”‚   β”‚   └── web_scraper.py
β”‚   └── matching/            # Matching engine
β”‚       β”œβ”€β”€ data_preparation.py
β”‚       β”œβ”€β”€ matcher.py
β”‚       └── utils.py
β”œβ”€β”€ data/                    # Data storage
β”‚   β”œβ”€β”€ source_a/            # Primary dataset
β”‚   β”œβ”€β”€ source_b/            # Secondary dataset
β”‚   β”œβ”€β”€ processed/           # Cleaned data
β”‚   └── matched/             # Matching results
β”œβ”€β”€ visualization/           # Analytics dashboard
β”‚   └── analytics_dashboard.html
└── tests/                   # Unit tests

πŸ”§ Customization

Adding New Data Sources

  1. Update config/config.ini with new paths
  2. Create data loaders in src/matching/data_preparation.py
  3. Adjust field mappings in matching configuration

Custom Matching Logic

Override methods in src/matching/matcher.py:

class CustomMatcher(LSHMatcher):
    def _build_feature_pipeline(self, company_col: str):
        # Custom text processing logic
        pass

Web Scraping Configuration

Implement scrape_entities_from_html() for your target website:

def scrape_entities_from_html(self, html: str):
    # Parse HTML specific to your data source
    soup = BeautifulSoup(html, "html.parser")
    # Extract entity information
    return entities

🀝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Submit a pull request

πŸ“„ License

MIT License - see LICENSE file for details

About

A production-ready data pipeline for entity matching, web data enrichment, and analytics visualization using Apache Airflow, PySpark, and Docker.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors