Skip to content

Commit 4d4cca9

Browse files
committed
readme
1 parent 0649c10 commit 4d4cca9

1 file changed

Lines changed: 24 additions & 89 deletions

File tree

README.md

Lines changed: 24 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -12,77 +12,44 @@ An interactive map showing where Swiss municipalities host their email — wheth
1212

1313
The data pipeline has two stages:
1414

15-
1. **Resolve domains** — Fetches all ~2100 Swiss municipalities from Wikidata and the BFS (Swiss Statistics) API, scrapes municipal websites for email addresses, performs DNS lookups (MX, SPF, DKIM, Autodiscover), resolves SPF includes/redirects, follows CNAME chains, runs ASN lookups, and detects mail gateways. Outputs `municipality_domains.json`.
15+
1. **Resolve domains** — Fetches all ~2100 Swiss municipalities from Wikidata and the BFS (Swiss Statistics) API, applies manual overrides, scrapes municipal websites for email addresses, guesses domains from municipality names, and verifies candidates with MX lookups. Scores source agreement to pick the best domain. Outputs `municipality_domains.json`.
1616

17-
2. **Classify providers** — For each resolved domain, runs 11 concurrent DNS fingerprinting probes. Aggregates weighted evidence, computes confidence scores (0–100), and applies manual overrides from `overrides.json`. Outputs `data.json` (full) and `data.min.json` (minified for the frontend).
17+
2. **Classify providers** — For each resolved domain, looks up all MX hosts, pattern-matches them, then runs 10 concurrent probes (SPF, DKIM, DMARC, Autodiscover, CNAME chain, SMTP banner, Tenant, ASN, TXT verification, SPF IP). Aggregates weighted evidence, computes confidence scores (0–100). Outputs `data.json` (full) and `data.min.json` (minified for the frontend).
1818

1919
```mermaid
2020
flowchart TD
2121
subgraph resolve ["1 · Resolve domains"]
22-
wikidata[/"Wikidata SPARQL"/] --> fetch["Fetch ~2100 municipalities"]
23-
bfs[/"BFS Statistics API"/] --> fetch
24-
fetch --> scrape["Scrape websites for<br/>email addresses"]
25-
fetch --> guess["Generate domain<br/>candidates"]
26-
scrape --> dns["MX + SPF + DKIM +<br/>Autodiscover lookups<br/>(3 resolvers)"]
27-
guess --> dns
28-
dns --> spf_resolve["Resolve SPF includes<br/>& redirects"]
29-
spf_resolve --> cname["Follow CNAME chains"]
30-
cname --> asn["ASN lookups<br/>(Team Cymru)"]
31-
asn --> gateway["Detect gateways<br/>(SeppMail, Barracuda,<br/>Proofpoint, Sophos …)"]
22+
bfs[/"BFS Statistics API"/] --> merge["Merge ~2100 municipalities"]
23+
wikidata[/"Wikidata SPARQL"/] --> merge
24+
overrides[/"overrides.json"/] --> per_muni
25+
merge --> per_muni["Per municipality"]
26+
per_muni --> scrape["Scrape website for<br/>email addresses"]
27+
per_muni --> guess["Guess domains<br/>from name"]
28+
scrape --> mx_verify["MX lookup to<br/>verify domains"]
29+
guess --> mx_verify
30+
mx_verify --> score["Score source<br/>agreement"]
3231
end
3332
34-
gateway --> domains[("municipality_domains.json")]
35-
domains --> probes
33+
score --> domains[("municipality_domains.json")]
34+
domains --> classify_in
3635
3736
subgraph classify ["2 · Classify providers"]
38-
probes["11 concurrent probes<br/>MX · SPF · DKIM · DMARC<br/>Autodiscover · CNAME · SMTP<br/>Tenant · ASN · TXT · SPF IP"]
39-
probes --> aggregate["Aggregate weighted<br/>evidence per provider"]
40-
aggregate --> score["Confidence scoring<br/>0–100"]
41-
score --> overrides["Apply manual overrides"]
37+
classify_in["Per unique domain"] --> mx_lookup["MX lookup<br/>(all hosts)"]
38+
mx_lookup --> mx_match["Pattern-match MX<br/>+ detect gateway"]
39+
mx_match --> concurrent["10 concurrent probes<br/>SPF · DKIM · DMARC<br/>Autodiscover · CNAME chain<br/>SMTP · Tenant · ASN<br/>TXT verification · SPF IP"]
40+
concurrent --> aggregate["Aggregate weighted<br/>evidence"]
41+
aggregate --> vote["Primary vote<br/>+ confidence scoring"]
4242
end
4343
44-
overrides --> data[("data.json + data.min.json")]
45-
data --> frontend["Interactive Leaflet map<br/>mxmap.ch"]
46-
47-
style wikidata fill:#e8f4fd,stroke:#4a90d9,color:#1a5276
48-
style bfs fill:#e8f4fd,stroke:#4a90d9,color:#1a5276
49-
style domains fill:#d5f5e3,stroke:#27ae60,color:#1e8449
50-
style data fill:#d5f5e3,stroke:#27ae60,color:#1e8449
51-
style frontend fill:#d5f5e3,stroke:#27ae60,color:#1e8449
44+
vote --> data[("data.json + data.min.json")]
45+
data --> frontend["Leaflet map<br/>mxmap.ch"]
5246
```
5347

5448
## Classification system
5549

56-
Each domain is classified into one of six provider categories:
57-
58-
| Provider | Examples |
59-
|---|---|
60-
| **Microsoft 365** | `mail.protection.outlook.com`, `onmicrosoft.com` |
61-
| **Google Workspace** | `aspmx.l.google.com`, `_spf.google.com` |
62-
| **AWS** | `amazonaws.com`, `amazonses.com` |
63-
| **Infomaniak** | `mxpool.infomaniak.com`, `spf.infomaniak.ch` |
64-
| **Swiss ISP** | Swisscom, Sunrise, Init7, SWITCH, Hostpoint, and others (13 known ASNs) |
65-
| **Independent** | Self-hosted or other providers |
66-
67-
Classification uses 11 weighted signal types collected as evidence:
68-
69-
| Signal | Weight | Source |
70-
|---|---|---|
71-
| MX | 0.20 | MX record hostnames |
72-
| SPF | 0.20 | SPF `include:` directives |
73-
| DKIM | 0.15 | DKIM selector CNAME targets |
74-
| Tenant | 0.10 | Microsoft 365 tenant detection |
75-
| SPF IP | 0.08 | IP addresses in SPF records |
76-
| Autodiscover | 0.08 | Autodiscover CNAME/SRV records |
77-
| TXT verification | 0.07 | TXT verification records (e.g. `ms=ms`) |
78-
| SMTP | 0.04 | SMTP banner on port 25 |
79-
| CNAME chain | 0.03 | MX host CNAME resolution |
80-
| ASN | 0.03 | ASN ownership (Team Cymru) |
81-
| DMARC | 0.02 | DMARC TXT record patterns |
82-
83-
Confidence is computed as `vote_share * depth_factor` on a 0–100 scale. Some signals (Tenant, TXT verification, ASN, SPF IP) are confirmation-only — they can reinforce but not establish a classification.
84-
85-
Mail gateways (SeppMail, Barracuda, Proofpoint, Sophos, etc.) are detected and reported separately.
50+
see [`classifier.py`](src/mail_sovereignty/classifier.py) for the full implementation details, but in summary,
51+
we use a weighted evidence system where each probe contributes signals of varying strength towards different provider classifications.
52+
8653

8754
## Quick start
8855

@@ -112,44 +79,12 @@ uv run ruff check src tests
11279
uv run ruff format src tests
11380
```
11481

115-
## Project structure
116-
117-
```
118-
src/mail_sovereignty/
119-
├── cli.py # CLI entry points (resolve-domains, classify-providers)
120-
├── resolve.py # Stage 1: domain resolution pipeline
121-
├── pipeline.py # Stage 2: classification orchestration
122-
├── classifier.py # Evidence aggregation and confidence scoring
123-
├── probes.py # 11 async DNS probe functions
124-
├── signatures.py # Provider fingerprint patterns and ASN mappings
125-
├── models.py # Pydantic models (Provider, SignalKind, Evidence, …)
126-
├── dns.py # DNS resolver setup and MX lookups
127-
├── bfs_api.py # BFS (Swiss Statistics) API client
128-
├── constants.py # Wikidata SPARQL query, skip domains, canton mappings
129-
└── log.py # Logging setup (loguru)
130-
131-
tests/ # 13 test modules, pytest + pytest-asyncio
132-
133-
index.html # Frontend SPA (Leaflet map + TopoJSON boundaries)
134-
impressum.html # Legal impressum
135-
datenschutz.html # Privacy policy
136-
```
137-
138-
## Data files
139-
140-
| File | Description |
141-
|-----------------------------|---------------------------------------------------------------------|
142-
| `municipality_domains.json` | Intermediate output from Stage 1 — domains, MX/SPF records, sources |
143-
| `data.json` | Final classifications with full evidence and confidence scores |
144-
| `data.min.json` | Minified version served to the frontend |
145-
| `overrides.json` | Manual classification corrections for edge cases |
146-
| `mxmap.log` | Pipeline log files (might be outdated) |
147-
14882

14983
## Related work
15084

15185
* [hpr4379 :: Mapping Municipalities' Digital Dependencies](https://hackerpublicradio.org/eps/hpr4379/index.html)
15286
* If you know of other similar projects, please open an issue or submit a PR to add them here!
87+
* See also the forks of this repository for related efforts.
15388

15489
## Contributing
15590

0 commit comments

Comments
 (0)