End-to-end automated system for discovering, classifying, and enriching nautical business leads across France — powered by the Google Places API, Scikit-Learn, and SQLite.
This pipeline transforms a manual, time-consuming prospecting process into an automated industrial workflow. Starting from a geographic grid query against the Google Places API, it filters semantic noise with a multi-label ML classifier, enriches confirmed leads with contact information, and delivers a clean, normalised SQLite database ready for outreach campaigns.
Key results on the current dataset (1,235 leads):
- Classifier Hamming Loss: 0.046
- Exact-match accuracy (12 labels simultaneously): 64.3%
- Accuracy within 1-label tolerance: >90%
- API cost savings via coastal buffer filter: up to 90%
- Geospatial Collection — Grid-based scraping with configurable region bounds and a shapely coastal buffer filter that eliminates inland grid points before any API call
- Website Text Harvesting — Lightweight HTTP crawler captures public page content during collection, ready for the ML step
- Multi-Label ML Filtering — TF-IDF + logistic regression assigns 12 nautical activity tags (
is_trad,is_surf,is_plongee, …) and drops non-nautical businesses - Three-Layer Deduplication —
place_id, normalised website URL, andname + citycomposite key - Contact Enrichment — Email and phone extraction runs only on ML-confirmed leads, saving time ; random wait between requests to avoid rate limits
- SQL Synchronisation — One command syncs the master CSV to a normalised SQLite schema (
leads,leads_content,master_view) - Public Snapshot — Anonymised export (contacts replaced with
FOUND/NOT FOUND) for safe GitHub hosting
pip install -r requirements.txtCopy .env.example to .env and set your credentials:
| Variable | Description |
|---|---|
GOOGLE_MAPS_API_KEY |
Google Places API key |
GMAIL_USER |
Gmail sender address |
GMAIL_APP_PASSWORD |
Gmail app password |
SENDER_NAME |
Display name for outreach emails |
# 1. Collect raw leads from Google Places (includes website text scraping)
python cli.py collect --region bretagne
# 2. ML filter: assign 12 activity tags, keep is_target == 1 leads only
python cli.py predict --region bretagne
# 3. Enrich confirmed leads with email + phone
python cli.py enrich --region bretagne
# 4. Standardise schema, generate unique keys, upsert into master dataset
python cli.py consolidate --region bretagne
# 5. Sync master CSV to SQLite database
python cli.py sync-db
# 6. Export anonymised public snapshot
python cli.py snapshot
# --- Utilities ---
python cli.py status # Data-quality dashboard
python cli.py train # Retrain the ML classifier
python cli.py mail --dry-run # Preview outreach campaign
python cli.py mail # Send outreach campaignGoogle Places API
│
▼
data/{region}_leads.csv ← collect (Places metadata + scraped_text)
│
▼
data/{region}_predicted.csv ← predict (ML filter — is_target == 1 only)
│
▼
data/{region}_enriched.csv ← enrich (email + phone)
│
▼
data/_locked/final_for_sql.csv ← consolidate (standardise + unique_key + upsert)
│ │
▼ ▼
nautical_leads.db public_nautical_data.csv
↑ ↑
sync-db snapshot → Dashboard / GitHub
streamlit run ui/app.pyFor in-depth details on the ML architecture, labelling strategy, vectorisation pipeline, and performance analysis:
