Nautical Lead Intelligence Pipeline

End-to-end automated system for discovering, classifying, and enriching nautical business leads across France — powered by the Google Places API, Scikit-Learn, and SQLite.

Overview

This pipeline transforms a manual, time-consuming prospecting process into an automated industrial workflow. Starting from a geographic grid query against the Google Places API, it filters semantic noise with a multi-label ML classifier, enriches confirmed leads with contact information, and delivers a clean, normalised SQLite database ready for outreach campaigns.

Key results on the current dataset (1,235 leads):

Classifier Hamming Loss: 0.046
Exact-match accuracy (12 labels simultaneously): 64.3%
Accuracy within 1-label tolerance: >90%
API cost savings via coastal buffer filter: up to 90%

Core Features

Geospatial Collection — Grid-based scraping with configurable region bounds and a shapely coastal buffer filter that eliminates inland grid points before any API call
Website Text Harvesting — Lightweight HTTP crawler captures public page content during collection, ready for the ML step
Multi-Label ML Filtering — TF-IDF + logistic regression assigns 12 nautical activity tags (is_trad, is_surf, is_plongee, …) and drops non-nautical businesses
Three-Layer Deduplication — place_id, normalised website URL, and name + city composite key
Contact Enrichment — Email and phone extraction runs only on ML-confirmed leads, saving time ; random wait between requests to avoid rate limits
SQL Synchronisation — One command syncs the master CSV to a normalised SQLite schema (leads, leads_content, master_view)
Public Snapshot — Anonymised export (contacts replaced with FOUND / NOT FOUND) for safe GitHub hosting

Quick Start

Prerequisites

pip install -r requirements.txt

Copy .env.example to .env and set your credentials:

Variable	Description
`GOOGLE_MAPS_API_KEY`	Google Places API key
`GMAIL_USER`	Gmail sender address
`GMAIL_APP_PASSWORD`	Gmail app password
`SENDER_NAME`	Display name for outreach emails

Full Pipeline

# 1. Collect raw leads from Google Places (includes website text scraping)
python cli.py collect --region bretagne

# 2. ML filter: assign 12 activity tags, keep is_target == 1 leads only
python cli.py predict --region bretagne

# 3. Enrich confirmed leads with email + phone
python cli.py enrich --region bretagne

# 4. Standardise schema, generate unique keys, upsert into master dataset
python cli.py consolidate --region bretagne

# 5. Sync master CSV to SQLite database
python cli.py sync-db

# 6. Export anonymised public snapshot
python cli.py snapshot

# --- Utilities ---
python cli.py status          # Data-quality dashboard
python cli.py train           # Retrain the ML classifier
python cli.py mail --dry-run  # Preview outreach campaign
python cli.py mail            # Send outreach campaign

Data Flow

Google Places API
      │
      ▼
data/{region}_leads.csv          ← collect     (Places metadata + scraped_text)
      │
      ▼
data/{region}_predicted.csv      ← predict     (ML filter — is_target == 1 only)
      │
      ▼
data/{region}_enriched.csv       ← enrich      (email + phone)
      │
      ▼
data/_locked/final_for_sql.csv   ← consolidate (standardise + unique_key + upsert)
      │                  │
      ▼                  ▼
nautical_leads.db    public_nautical_data.csv
    ↑                       ↑
 sync-db             snapshot → Dashboard / GitHub

Dashboard

streamlit run ui/app.py

Technical Methodology

For in-depth details on the ML architecture, labelling strategy, vectorisation pipeline, and performance analysis:

🇬🇧 METHODOLOGY_en.md
🇫🇷 METHODOLOGY_fr.md

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
analytics		analytics
data		data
docs		docs
reports/nautic_classifier_v1		reports/nautic_classifier_v1
scripts		scripts
src		src
ui		ui
.gitignore		.gitignore
METHODOLOGY_en.md		METHODOLOGY_en.md
METHODOLOGY_fr.md		METHODOLOGY_fr.md
README.md		README.md
cli.py		cli.py
dashboard_screenshot.jpg		dashboard_screenshot.jpg
database_screenshot.png		database_screenshot.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nautical Lead Intelligence Pipeline

Overview

Core Features

Quick Start

Prerequisites

Full Pipeline

Data Flow

Dashboard

Technical Methodology

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Nautical Lead Intelligence Pipeline

Overview

Core Features

Quick Start

Prerequisites

Full Pipeline

Data Flow

Dashboard

Technical Methodology

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages