protomaps · migurski · Mar 12, 2026 · Mar 12, 2026 · Mar 12, 2026 · Mar 12, 2026
diff --git a/tiles/WIKIDATA.md b/tiles/WIKIDATA.md
@@ -0,0 +1,148 @@
+# Wikidata ID Availability in Overture Data
+
+Investigation into whether Overture POIs can be linked to Wikidata for richer rendering.
+
+## Data sources examined
+
+`Oakland-visualtests.parquet` — Overture 2026-02-18 release, Bay Area coverage.
+
+## Findings by theme
+
+### `places` theme (618,880 features)
+
+- **Top-level `wikidata` field**: entirely null — 0 out of 618,880 features have a value.
+- **`brand.wikidata`**: present for ~11,059 features — chain businesses where the *brand entity* has a Wikidata ID (e.g. `Q177054` for Burger King). Does not help with unique/famous places.
+- **Unique/famous places** (Oakland Zoo, Oakland Museum of California, Oakland International Airport, UC Berkeley, etc.): no wikidata at all, neither top-level nor brand.
+
+### `divisions` theme (3,615 features)
+
+- **Top-level `wikidata` field**: populated for cities, counties, etc. (e.g. `Q62` for San Francisco, `Q927122` for South San Francisco).
+- These feed the `places` map layer via `Places.java`, which already exports the `wikidata` attribute to output tiles (lines 324–325 and 431).
+
+## Alternative: website domain matching
+
+Overture places features often include `websites` and `socials` arrays.
+
+- **`socials`**: contains Facebook URLs with numeric page IDs (e.g. `facebook.com/353030440227`). Wikidata stores Facebook *usernames* (P2013), not numeric IDs — no direct join possible.
+- **`websites`**: contains place-specific URLs. Extracting the root domain and matching against Wikidata P856 (official website) works:
+  - `oaklandzoo.org` → Q2008530 (Oakland Zoo) ✅
+  - `museumca.org` → Q877714 (Oakland Museum of California) ✅
+  - `oaklandairport.com` / `flyoakland.com` → Oakland International Airport ✅
+
+### Scale
+
+- ~303,848 unique meaningful domains across all places features (excluding generic social/link domains).
+- Too large to bulk-query the Wikidata SPARQL endpoint (60-second hard timeout).
+
+### Caveats
+
+- Multiple Overture features can share the same domain (e.g. all Oakland Public Library branches → `oaklandlibrary.org`), so the match links to the organization entity, not a specific location.
+- Some domains map to multiple QIDs (e.g. `museumca.org` → Q877714, Q133252684, Q30672317) — needs disambiguation.
+- No coverage for places without websites (~half of all features).
+
+## Bulk Wikidata P856 export via QLever
+
+The Wikidata SPARQL endpoint times out on full P856 scans. [QLever](https://qlever.dev/wikidata) (University of Freiburg) is a faster alternative engine that handles full-dataset scans. The complete P856 table (2.3M rows) was fetched in one query in ~16 seconds:
+
+```
+PREFIX wdt: <http://www.wikidata.org/prop/direct/>
+SELECT ?item ?website WHERE { ?item wdt:P856 ?website }
+```
+
+```sh
+curl -H "Accept: text/tab-separated-values" \
+  --data-urlencode "query=PREFIX wdt: <http://www.wikidata.org/prop/direct/> SELECT ?item ?website WHERE { ?item wdt:P856 ?website }" \
+  --data-urlencode "send=2400000" \
+  "https://qlever.dev/api/wikidata" \
+  -o data/sources/wikidata-p856.tsv
+```
+
+The TSV was then parsed to extract QID and root domain, and saved as `data/sources/wikidata-p856.parquet` (36 MB, 2.3M rows).
+
+## Match rate against Overture Oakland data
+
+Joining `wikidata-p856.parquet` against `Oakland-visualtests.parquet` on root domain (excluding generic domains like facebook.com, yelp.com, etc.):
+
+- **522,799** places have a usable website URL
+- **121,685** (23.3%) matched to at least one Wikidata QID
+
+Confirmed matches for notable places:
+- `oaklandzoo.org` → Q2008530 (Oakland Zoo) ✅
+- `museumca.org` → Q877714 (Oakland Museum of California) ✅
+- `iflyoak.com` → Q1165584 (Oakland International Airport) ✅
+- `berkeley.edu` → Q168756 (UC Berkeley) ✅
+
+Top matched categories: doctor, park, government association, medical center, hotel, university, library, landmark.
+
+## Disambiguation: multiple QIDs per domain
+
+142,182 domains map to more than one QID in the P856 table (avg 6.7, max 81,822 for a digital library collection domain where every digitized item has its own Wikidata entry).
+
+### Approaches considered
+
+**Full URL path matching** — prefer the QID whose stored P856 URL most closely matches the full Overture URL, not just the domain. Ruled out: Wikidata typically stores bare root URLs (`http://www.oaklandzoo.org`), so this rarely breaks ties.
+
+**P31 instance-of type matching** — fetch P31 (instance of) for all candidate QIDs and prefer the one whose type aligns with the Overture category (e.g. Overture `museum` → prefer QID with `instance of: Q33506`). Ruled out: P31 has 122M rows in Wikidata; QLever serves at most ~15M rows per query and the full download fails. Batching 920k candidate QIDs via the SPARQL endpoint would be slow and fragile.
+
+**Lowest Q-number tiebreak** — prefer the QID with the smallest numeric value. This works because Wikidata assigns lower Q-numbers to older, more established entities. Exhibitions, digitized sub-items, and branch locations all post-date their parent organizations and receive higher Q-numbers.
+
+### Result
+
+The lowest Q-number heuristic gets the right answer in all tested cases:
+
+|        Domain        | Winner QID |                                  Entity                                  |
+|----------------------|------------|--------------------------------------------------------------------------|
+| `museumca.org`       | Q877714    | Oakland Museum of California (not the exhibitions Q133252684, Q30672317) |
+| `oaklandzoo.org`     | Q2008530   | Oakland Zoo                                                              |
+| `iflyoak.com`        | Q1165584   | Oakland International Airport                                            |
+| `berkeley.edu`       | Q168756    | UC Berkeley                                                              |
+| `oaklandlibrary.org` | Q1090829   | Oakland Public Library (not individual branches)                         |
+| `bart.gov`           | Q250113    | Bay Area Rapid Transit                                                   |
+
+### Output
+
+`data/sources/wikidata-domain-qid.parquet` — 1,432,271 domain → QID mappings, 30 MB. Built by grouping `wikidata-p856.parquet` by domain and taking the minimum Q-number per domain.
+
+## Integration with the tile build pipeline
+
+### Distribution format
+
+The final file should be published as a dated gzipped two-column CSV, e.g.:
+
+```
+wikidata-website-qid-2026-03.csv.gz
+domain,qid
+oaklandzoo.org,Q2008530
+museumca.org,Q877714
+...
+```
+
+This mirrors the format and hosting pattern of `qrank.csv.gz` (from `qrank.toolforge.org`), but hosted on `r2-public.protomaps.com` since this is a derived file we generate ourselves. A dated URL (like `Overture-QRank-2025-12-17.parquet`) makes renders reproducible. It would be regenerated periodically by re-running the QLever P856 query and rebuilding.
+
+The file is generated by `generate-wikidata-website-qid.sh` at the repo root. The script fetches the full P856 table from QLever, extracts root domains, disambiguates via lowest Q-number, and writes the gzipped CSV — producing ~1.4M rows at ~15 MB in under 30 seconds.
+
+### Runtime lookup (two-hop via QRank)
+
+At render time the file is downloaded once into the sources directory if not present, then loaded into a `WebsiteQidDb` (a new class modeled on `QrankDb`) as a `HashMap<String, Long>` — domain string → numeric Q-ID.
+
+In `Pois.processOverture`, the website lookup acts as a fallback that fills in a wikidata ID when the feature doesn't have one natively, and then the existing QRank machinery takes over unchanged:
+
+```java
+String wikidata = sf.getString("wikidata");  // always null for Overture places theme
+if (wikidata == null) {
+    String website = /* first entry from sf.getList("websites") */;
+    wikidata = websiteQidDb.getQid(website);  // domain → "Q2008530"
+}
+long qrank = (wikidata != null) ? qrankDb.get(wikidata) : 0;
+var qrankedZoom = QrankDb.assignZoom(qrankGrading, kind, qrank);
+```
+
+The full lookup chain is:
+
+```
+sf.websites[0] → domain → Q-ID (WebsiteQidDb)
+                           Q-ID → qrank score (QrankDb)
+                                   qrank score → minZoom (assignZoom)
+```
+
+No changes are needed to `QrankDb` or `assignZoom` — `QrankDb.get(long)` already accepts a numeric ID. A place only benefits if it has a matching website entry *and* a QRank score; otherwise `qrank = 0` and behavior is identical to today.
diff --git a/tiles/generate-wikidata-website-qid.sh b/tiles/generate-wikidata-website-qid.sh
@@ -0,0 +1,43 @@
+#!/bin/bash -ex
+
+# Generate wikidata-website-qid.csv.gz -- a mapping from website domain to Wikidata QID.
+#
+# Fetches the complete Wikidata P856 (official website) table from QLever,
+# extracts root domains, disambiguates multiple QIDs per domain by taking the
+# lowest Q-number, and writes a gzipped two-column CSV.
+#
+# Output: data/sources/wikidata-website-qid-YYYY-MM.csv.gz
+# Usage: ./generate-wikidata-website-qid.sh
+
+DATE=$(date +%Y-%m)
+OUTPUT="data/sources/wikidata-website-qid-${DATE}.csv.gz"
+TSV_TMP=$(mktemp /tmp/wikidata-p856-XXXXXX) && mv "$TSV_TMP" "${TSV_TMP}.tsv" && TSV_TMP="${TSV_TMP}.tsv"
+
+echo "Fetching Wikidata P856 (official website) from QLever..."
+curl \
+  -H "Accept: text/tab-separated-values" \
+  --data-urlencode "query=PREFIX wdt: <http://www.wikidata.org/prop/direct/> SELECT ?item ?website WHERE { ?item wdt:P856 ?website }" \
+  --data-urlencode "send=2400000" \
+  "https://qlever.dev/api/wikidata" \
+  -o "$TSV_TMP"
+
+echo "Building domain -> QID mapping..."
+duckdb -c "
+COPY (
+  SELECT
+    regexp_extract(lower(\"?website\"), 'https?://(?:www\\.)?([^/>\?]+)', 1) AS domain,
+    arg_min(
+      regexp_extract(\"?item\", 'entity/(Q[0-9]+)', 1),
+      CAST(regexp_extract(\"?item\", 'Q([0-9]+)', 1) AS INTEGER)
+    ) AS qid
+  FROM read_csv('${TSV_TMP}', delim='\t', header=true, ignore_errors=true)
+  WHERE regexp_extract(\"?item\", 'entity/(Q[0-9]+)', 1) != ''
+    AND regexp_extract(lower(\"?website\"), 'https?://(?:www\\.)?([^/>\?]+)', 1) != ''
+  GROUP BY domain
+  ORDER BY domain
+) TO '/dev/stdout' (FORMAT CSV, HEADER true)
+" | gzip > "$OUTPUT"
+
+rm "$TSV_TMP"
+
+echo "Done: ${OUTPUT} ($(du -sh "$OUTPUT" | cut -f1))"
diff --git a/tiles/src/main/java/com/protomaps/basemap/Basemap.java b/tiles/src/main/java/com/protomaps/basemap/Basemap.java
@@ -8,6 +8,7 @@
 import com.onthegomap.planetiler.util.Downloader;
 import com.protomaps.basemap.feature.CountryCoder;
 import com.protomaps.basemap.feature.QrankDb;
+import com.protomaps.basemap.feature.WebsiteQidDb;
 import com.protomaps.basemap.layers.Boundaries;
 import com.protomaps.basemap.layers.Buildings;
 import com.protomaps.basemap.layers.Earth;
@@ -38,7 +39,7 @@ public class Basemap extends ForwardingProfile {
 
   private static final Logger LOGGER = LoggerFactory.getLogger(Basemap.class);
 
-  public Basemap(QrankDb qrankDb, CountryCoder countryCoder, Clip clip,
+  public Basemap(QrankDb qrankDb, WebsiteQidDb websiteQidDb, CountryCoder countryCoder, Clip clip,
     String layer) {
 
     if (layer.isEmpty() || layer.equals(Boundaries.LAYER_NAME)) {
@@ -78,7 +79,7 @@ public Basemap(QrankDb qrankDb, CountryCoder countryCoder, Clip clip,
     }
 
     if (layer.isEmpty() || layer.equals(Pois.LAYER_NAME)) {
-      var poi = new Pois(qrankDb);
+      var poi = new Pois(qrankDb, websiteQidDb);
       registerHandler(poi);
       registerSourceHandler("osm", poi::processOsm);
       registerSourceHandler("pm:overture", poi::processOverture);
@@ -206,12 +207,12 @@ public static void main(String[] args) throws IOException {
   }
 
   private static void printVersion() {
-    Basemap basemap = new Basemap(null, null, null, "");
+    Basemap basemap = new Basemap(null, null, null, null, "");
     System.out.println(basemap.version());
   }
 
   private static void printHelp() {
-    Basemap basemap = new Basemap(null, null, null, "");
+    Basemap basemap = new Basemap(null, null, null, null, "");
     System.out.println(String.format("""
       %s v%s
       %s
@@ -317,6 +318,16 @@ static void run(Arguments args) throws IOException {
 
     var qrankDb = QrankDb.fromCsv(qrankCsv);
 
+    Path websiteQidCsv = sourcesDir.resolve("wikidata-website-qid-2026-03.csv.gz");
+    if (!Files.exists(websiteQidCsv)) {
+      Downloader.create(planetiler.config())
+        .add("wikidata-website-qid",
+          "https://954.teczno.com/~migurski/tmp/wikidata-website-qid.csv.gz",
+          websiteQidCsv)
+        .run();
+    }
+    var websiteQidDb = WebsiteQidDb.fromCsv(websiteQidCsv);
+
     if (!Files.exists(pgfEncodingZip)) {
       Downloader.create(planetiler.config())
         .add("pgf-encoding", "https://wipfli.github.io/pgf-encoding/pgf-encoding.zip", pgfEncodingZip)
@@ -375,7 +386,7 @@ static void run(Arguments args) throws IOException {
       outputName = area;
     }
 
-    planetiler.setProfile(new Basemap(qrankDb, countryCoder, clip, layer))
+    planetiler.setProfile(new Basemap(qrankDb, websiteQidDb, countryCoder, clip, layer))
       .setOutput(Path.of(outputName + ".pmtiles"))
       .run();
   }

diff --git a/tiles/src/main/java/com/protomaps/basemap/feature/WebsiteQidDb.java b/tiles/src/main/java/com/protomaps/basemap/feature/WebsiteQidDb.java
@@ -0,0 +1,79 @@
+package com.protomaps.basemap.feature;
+
+import java.io.*;
+import java.nio.file.Path;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.zip.GZIPInputStream;
+
+/**
+ * An in-memory mapping from website domain to Wikidata Q-ID, used to enrich Overture POIs (which lack native wikidata
+ * fields) for QRank-based zoom assignment.
+ * <p>
+ * Parses a gzipped CSV with columns {@code domain,qid} into a HashMap for efficient lookup.
+ **/
+public final class WebsiteQidDb {
+
+  private final Map<String, Long> db;
+
+  public WebsiteQidDb(Map<String, Long> db) {
+    this.db = db;
+  }
+
+  /**
+   * Extracts the root domain from a URL and looks up the corresponding Wikidata Q-ID.
+   *
+   * @param url a full URL such as "https://www.iflyoak.com/flights"
+   * @return a Wikidata Q-ID string like "Q1165584", or null if not found
+   */
+  public String getQid(String url) {
+    if (url == null || url.isEmpty()) {
+      return null;
+    }
+    String domain = url;
+    // Strip protocol
+    if (domain.startsWith("https://")) {
+      domain = domain.substring("https://".length());
+    } else if (domain.startsWith("http://")) {
+      domain = domain.substring("http://".length());
+    }
+    // Strip www. prefix
+    if (domain.startsWith("www.")) {
+      domain = domain.substring("www.".length());
+    }
+    // Take portion up to first /
+    int slash = domain.indexOf('/');
+    if (slash >= 0) {
+      domain = domain.substring(0, slash);
+    }
+    Long id = db.get(domain);
+    return id != null ? "Q" + id : null;
+  }
+
+  public static WebsiteQidDb fromCsv(Path csvPath) throws IOException {
+    GZIPInputStream gzip = new GZIPInputStream(new FileInputStream(csvPath.toFile()));
+    try (BufferedReader br = new BufferedReader(new InputStreamReader(gzip))) {
+      String content;
+      Map<String, Long> db = new HashMap<>();
+      String header = br.readLine(); // header
+      assert (header.equals("domain,qid"));
+      while ((content = br.readLine()) != null) {
+        int lastComma = content.lastIndexOf(',');
+        if (lastComma < 0) {
+          continue;
+        }
+        String domain = content.substring(0, lastComma);
+        String qid = content.substring(lastComma + 1);
+        if (qid.startsWith("Q")) {
+          qid = qid.substring(1);
+        }
+        try {
+          db.put(domain, Long.parseLong(qid));
+        } catch (NumberFormatException e) {
+          // skip malformed rows
+        }
+      }
+      return new WebsiteQidDb(db);
+    }
+  }
+}