Skip to content

MorganRff/nautic-optimizer

Repository files navigation

Nautical Lead Intelligence Pipeline

End-to-end automated system for discovering, classifying, and enriching nautical business leads across France — powered by the Google Places API, Scikit-Learn, and SQLite.


Overview

This pipeline transforms a manual, time-consuming prospecting process into an automated industrial workflow. Starting from a geographic grid query against the Google Places API, it filters semantic noise with a multi-label ML classifier, enriches confirmed leads with contact information, and delivers a clean, normalised SQLite database ready for outreach campaigns.

Key results on the current dataset (1,235 leads):

  • Classifier Hamming Loss: 0.046
  • Exact-match accuracy (12 labels simultaneously): 64.3%
  • Accuracy within 1-label tolerance: >90%
  • API cost savings via coastal buffer filter: up to 90%

Core Features

  • Geospatial Collection — Grid-based scraping with configurable region bounds and a shapely coastal buffer filter that eliminates inland grid points before any API call
  • Website Text Harvesting — Lightweight HTTP crawler captures public page content during collection, ready for the ML step
  • Multi-Label ML Filtering — TF-IDF + logistic regression assigns 12 nautical activity tags (is_trad, is_surf, is_plongee, …) and drops non-nautical businesses
  • Three-Layer Deduplicationplace_id, normalised website URL, and name + city composite key
  • Contact Enrichment — Email and phone extraction runs only on ML-confirmed leads, saving time ; random wait between requests to avoid rate limits
  • SQL Synchronisation — One command syncs the master CSV to a normalised SQLite schema (leads, leads_content, master_view)
  • Public Snapshot — Anonymised export (contacts replaced with FOUND / NOT FOUND) for safe GitHub hosting

Quick Start

Prerequisites

pip install -r requirements.txt

Copy .env.example to .env and set your credentials:

Variable Description
GOOGLE_MAPS_API_KEY Google Places API key
GMAIL_USER Gmail sender address
GMAIL_APP_PASSWORD Gmail app password
SENDER_NAME Display name for outreach emails

Full Pipeline

# 1. Collect raw leads from Google Places (includes website text scraping)
python cli.py collect --region bretagne

# 2. ML filter: assign 12 activity tags, keep is_target == 1 leads only
python cli.py predict --region bretagne

# 3. Enrich confirmed leads with email + phone
python cli.py enrich --region bretagne

# 4. Standardise schema, generate unique keys, upsert into master dataset
python cli.py consolidate --region bretagne

# 5. Sync master CSV to SQLite database
python cli.py sync-db

# 6. Export anonymised public snapshot
python cli.py snapshot

# --- Utilities ---
python cli.py status          # Data-quality dashboard
python cli.py train           # Retrain the ML classifier
python cli.py mail --dry-run  # Preview outreach campaign
python cli.py mail            # Send outreach campaign

Data Flow

Google Places API
      │
      ▼
data/{region}_leads.csv          ← collect     (Places metadata + scraped_text)
      │
      ▼
data/{region}_predicted.csv      ← predict     (ML filter — is_target == 1 only)
      │
      ▼
data/{region}_enriched.csv       ← enrich      (email + phone)
      │
      ▼
data/_locked/final_for_sql.csv   ← consolidate (standardise + unique_key + upsert)
      │                  │
      ▼                  ▼
nautical_leads.db    public_nautical_data.csv
    ↑                       ↑
 sync-db             snapshot → Dashboard / GitHub

Dashboard

streamlit run ui/app.py

Dashboard Screenshot


Technical Methodology

For in-depth details on the ML architecture, labelling strategy, vectorisation pipeline, and performance analysis:

About

ML-powered nautical prospecting tool. 90% API cost reduction via coastline-aligned buffering. Built with Python, Scikit-Learn, and SQL. Dashboard live below.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages