You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+24-89Lines changed: 24 additions & 89 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,77 +12,44 @@ An interactive map showing where Swiss municipalities host their email — wheth
12
12
13
13
The data pipeline has two stages:
14
14
15
-
1.**Resolve domains** — Fetches all ~2100 Swiss municipalities from Wikidata and the BFS (Swiss Statistics) API, scrapes municipal websites for email addresses, performs DNS lookups (MX, SPF, DKIM, Autodiscover), resolves SPF includes/redirects, follows CNAME chains, runs ASN lookups, and detects mail gateways. Outputs `municipality_domains.json`.
15
+
1.**Resolve domains** — Fetches all ~2100 Swiss municipalities from Wikidata and the BFS (Swiss Statistics) API, applies manual overrides, scrapes municipal websites for email addresses, guesses domains from municipality names, and verifies candidates with MX lookups. Scores source agreement to pick the best domain. Outputs `municipality_domains.json`.
16
16
17
-
2.**Classify providers** — For each resolved domain, runs 11 concurrent DNS fingerprinting probes. Aggregates weighted evidence, computes confidence scores (0–100), and applies manual overrides from `overrides.json`. Outputs `data.json` (full) and `data.min.json` (minified for the frontend).
17
+
2.**Classify providers** — For each resolved domain, looks up all MX hosts, pattern-matches them, then runs 10 concurrent probes (SPF, DKIM, DMARC, Autodiscover, CNAME chain, SMTP banner, Tenant, ASN, TXT verification, SPF IP). Aggregates weighted evidence, computes confidence scores (0–100). Outputs `data.json` (full) and `data.min.json` (minified for the frontend).
Confidence is computed as `vote_share * depth_factor` on a 0–100 scale. Some signals (Tenant, TXT verification, ASN, SPF IP) are confirmation-only — they can reinforce but not establish a classification.
84
-
85
-
Mail gateways (SeppMail, Barracuda, Proofpoint, Sophos, etc.) are detected and reported separately.
50
+
see [`classifier.py`](src/mail_sovereignty/classifier.py) for the full implementation details, but in summary,
51
+
we use a weighted evidence system where each probe contributes signals of varying strength towards different provider classifications.
0 commit comments