From bf2aace836d14c809e1d7fffc8937474864bb1c6 Mon Sep 17 00:00:00 2001 From: Michal Migurski Date: Thu, 12 Mar 2026 15:07:38 -0700 Subject: [PATCH 01/10] Add WIKIDATA.md summarizing findings on Wikidata ID availability in Overture data Summarize your findings up to this point in WIKIDATA.md and commit it Co-Authored-By: Claude Sonnet 4.6 --- tiles/WIKIDATA.md | 48 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 48 insertions(+) create mode 100644 tiles/WIKIDATA.md diff --git a/tiles/WIKIDATA.md b/tiles/WIKIDATA.md new file mode 100644 index 00000000..3788a2c7 --- /dev/null +++ b/tiles/WIKIDATA.md @@ -0,0 +1,48 @@ +# Wikidata ID Availability in Overture Data + +Investigation into whether Overture POIs can be linked to Wikidata for richer rendering. + +## Data sources examined + +`Oakland-visualtests.parquet` — Overture 2026-02-18 release, Bay Area coverage. + +## Findings by theme + +### `places` theme (618,880 features) + +- **Top-level `wikidata` field**: entirely null — 0 out of 618,880 features have a value. +- **`brand.wikidata`**: present for ~11,059 features — chain businesses where the *brand entity* has a Wikidata ID (e.g. `Q177054` for Burger King). Does not help with unique/famous places. +- **Unique/famous places** (Oakland Zoo, Oakland Museum of California, Oakland International Airport, UC Berkeley, etc.): no wikidata at all, neither top-level nor brand. + +### `divisions` theme (3,615 features) + +- **Top-level `wikidata` field**: populated for cities, counties, etc. (e.g. `Q62` for San Francisco, `Q927122` for South San Francisco). +- These feed the `places` map layer via `Places.java`, which already exports the `wikidata` attribute to output tiles (lines 324–325 and 431). + +## Alternative: website domain matching + +Overture places features often include `websites` and `socials` arrays. + +- **`socials`**: contains Facebook URLs with numeric page IDs (e.g. `facebook.com/353030440227`). Wikidata stores Facebook *usernames* (P2013), not numeric IDs — no direct join possible. +- **`websites`**: contains place-specific URLs. Extracting the root domain and matching against Wikidata P856 (official website) works: + - `oaklandzoo.org` → Q2008530 (Oakland Zoo) ✅ + - `museumca.org` → Q877714 (Oakland Museum of California) ✅ + - `oaklandairport.com` / `flyoakland.com` → Oakland International Airport ✅ + +### Scale + +- ~303,848 unique meaningful domains across all places features (excluding generic social/link domains). +- Too large to bulk-query Wikidata for all places. +- Feasible as a **pre-built lookup table** scoped to notable categories: `zoo`, `museum`, `airport`, `stadium`, `aquarium`, `university`, `library` — estimated ~5–10k distinct domains. + +### Caveats + +- Multiple Overture features can share the same domain (e.g. all Oakland Public Library branches → `oaklandlibrary.org`), so the match links to the organization entity, not a specific location. +- No coverage for places without websites (~half of all features). + +## Recommended next step + +Build a one-time lookup table by: +1. Extracting root domains from `websites` for notable-category features +2. Batching SPARQL queries to Wikidata P856 to retrieve QIDs +3. Joining back to Overture IDs and storing as a small CSV/parquet for use in tile generation From c0c664021e354a8d0b526d53138788fb310601a1 Mon Sep 17 00:00:00 2001 From: Michal Migurski Date: Thu, 12 Mar 2026 15:16:54 -0700 Subject: [PATCH 02/10] Update WIKIDATA.md with QLever bulk export approach and match rate findings Summarize this exploration into WIKIDATA.md and commit Co-Authored-By: Claude Sonnet 4.6 --- tiles/WIKIDATA.md | 48 ++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 41 insertions(+), 7 deletions(-) diff --git a/tiles/WIKIDATA.md b/tiles/WIKIDATA.md index 3788a2c7..d3d73585 100644 --- a/tiles/WIKIDATA.md +++ b/tiles/WIKIDATA.md @@ -32,17 +32,51 @@ Overture places features often include `websites` and `socials` arrays. ### Scale - ~303,848 unique meaningful domains across all places features (excluding generic social/link domains). -- Too large to bulk-query Wikidata for all places. -- Feasible as a **pre-built lookup table** scoped to notable categories: `zoo`, `museum`, `airport`, `stadium`, `aquarium`, `university`, `library` — estimated ~5–10k distinct domains. +- Too large to bulk-query the Wikidata SPARQL endpoint (60-second hard timeout). ### Caveats - Multiple Overture features can share the same domain (e.g. all Oakland Public Library branches → `oaklandlibrary.org`), so the match links to the organization entity, not a specific location. +- Some domains map to multiple QIDs (e.g. `museumca.org` → Q877714, Q133252684, Q30672317) — needs disambiguation. - No coverage for places without websites (~half of all features). -## Recommended next step +## Bulk Wikidata P856 export via QLever -Build a one-time lookup table by: -1. Extracting root domains from `websites` for notable-category features -2. Batching SPARQL queries to Wikidata P856 to retrieve QIDs -3. Joining back to Overture IDs and storing as a small CSV/parquet for use in tile generation +The Wikidata SPARQL endpoint times out on full P856 scans. [QLever](https://qlever.dev/wikidata) (University of Freiburg) is a faster alternative engine that handles full-dataset scans. The complete P856 table (2.3M rows) was fetched in one query in ~16 seconds: + +``` +PREFIX wdt: +SELECT ?item ?website WHERE { ?item wdt:P856 ?website } +``` + +```sh +curl -H "Accept: text/tab-separated-values" \ + --data-urlencode "query=PREFIX wdt: SELECT ?item ?website WHERE { ?item wdt:P856 ?website }" \ + --data-urlencode "send=2400000" \ + "https://qlever.dev/api/wikidata" \ + -o data/sources/wikidata-p856.tsv +``` + +The TSV was then parsed to extract QID and root domain, and saved as `data/sources/wikidata-p856.parquet` (36 MB, 2.3M rows). + +## Match rate against Overture Oakland data + +Joining `wikidata-p856.parquet` against `Oakland-visualtests.parquet` on root domain (excluding generic domains like facebook.com, yelp.com, etc.): + +- **522,799** places have a usable website URL +- **121,685** (23.3%) matched to at least one Wikidata QID + +Confirmed matches for notable places: +- `oaklandzoo.org` → Q2008530 (Oakland Zoo) ✅ +- `museumca.org` → Q877714 (Oakland Museum of California) ✅ +- `iflyoak.com` → Q1165584 (Oakland International Airport) ✅ +- `berkeley.edu` → Q168756 (UC Berkeley) ✅ + +Top matched categories: doctor, park, government association, medical center, hotel, university, library, landmark. + +## Remaining open question + +When a domain maps to multiple QIDs, which to prefer? Options: +- The QID whose P856 URL most closely matches the full Overture URL (not just domain) +- The QID with the most Wikidata statements (a proxy for "most notable") +- The QID that is an instance of a place type matching the Overture category From 423144d22983a4a6743a737990730b4085906ba4 Mon Sep 17 00:00:00 2001 From: Michal Migurski Date: Thu, 12 Mar 2026 15:38:07 -0700 Subject: [PATCH 03/10] Update WIKIDATA.md with disambiguation findings; lowest Q-number heuristic wins Compare and contrast your two proposed disambiguation approaches; Try that combined approach Co-Authored-By: Claude Sonnet 4.6 --- tiles/WIKIDATA.md | 32 +++++++++++++++++++++++++++----- 1 file changed, 27 insertions(+), 5 deletions(-) diff --git a/tiles/WIKIDATA.md b/tiles/WIKIDATA.md index d3d73585..4ec8204b 100644 --- a/tiles/WIKIDATA.md +++ b/tiles/WIKIDATA.md @@ -74,9 +74,31 @@ Confirmed matches for notable places: Top matched categories: doctor, park, government association, medical center, hotel, university, library, landmark. -## Remaining open question +## Disambiguation: multiple QIDs per domain -When a domain maps to multiple QIDs, which to prefer? Options: -- The QID whose P856 URL most closely matches the full Overture URL (not just domain) -- The QID with the most Wikidata statements (a proxy for "most notable") -- The QID that is an instance of a place type matching the Overture category +142,182 domains map to more than one QID in the P856 table (avg 6.7, max 81,822 for a digital library collection domain where every digitized item has its own Wikidata entry). + +### Approaches considered + +**Full URL path matching** — prefer the QID whose stored P856 URL most closely matches the full Overture URL, not just the domain. Ruled out: Wikidata typically stores bare root URLs (`http://www.oaklandzoo.org`), so this rarely breaks ties. + +**P31 instance-of type matching** — fetch P31 (instance of) for all candidate QIDs and prefer the one whose type aligns with the Overture category (e.g. Overture `museum` → prefer QID with `instance of: Q33506`). Ruled out: P31 has 122M rows in Wikidata; QLever serves at most ~15M rows per query and the full download fails. Batching 920k candidate QIDs via the SPARQL endpoint would be slow and fragile. + +**Lowest Q-number tiebreak** — prefer the QID with the smallest numeric value. This works because Wikidata assigns lower Q-numbers to older, more established entities. Exhibitions, digitized sub-items, and branch locations all post-date their parent organizations and receive higher Q-numbers. + +### Result + +The lowest Q-number heuristic gets the right answer in all tested cases: + +| Domain | Winner QID | Entity | +|---|---|---| +| `museumca.org` | Q877714 | Oakland Museum of California (not the exhibitions Q133252684, Q30672317) | +| `oaklandzoo.org` | Q2008530 | Oakland Zoo | +| `iflyoak.com` | Q1165584 | Oakland International Airport | +| `berkeley.edu` | Q168756 | UC Berkeley | +| `oaklandlibrary.org` | Q1090829 | Oakland Public Library (not individual branches) | +| `bart.gov` | Q250113 | Bay Area Rapid Transit | + +### Output + +`data/sources/wikidata-domain-qid.parquet` — 1,432,271 domain → QID mappings, 30 MB. Built by grouping `wikidata-p856.parquet` by domain and taking the minimum Q-number per domain. From 9ec51667a6ed83ade0b92c0bc9004c62d20f71af Mon Sep 17 00:00:00 2001 From: Michal Migurski Date: Thu, 12 Mar 2026 15:49:08 -0700 Subject: [PATCH 04/10] Update WIKIDATA.md with integration design: distribution format and runtime lookup chain Update WIKIDATA.md with these findings. Co-Authored-By: Claude Sonnet 4.6 --- tiles/WIKIDATA.md | 42 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) diff --git a/tiles/WIKIDATA.md b/tiles/WIKIDATA.md index 4ec8204b..9edded56 100644 --- a/tiles/WIKIDATA.md +++ b/tiles/WIKIDATA.md @@ -102,3 +102,45 @@ The lowest Q-number heuristic gets the right answer in all tested cases: ### Output `data/sources/wikidata-domain-qid.parquet` — 1,432,271 domain → QID mappings, 30 MB. Built by grouping `wikidata-p856.parquet` by domain and taking the minimum Q-number per domain. + +## Integration with the tile build pipeline + +### Distribution format + +The final file should be published as a dated gzipped two-column CSV, e.g.: + +``` +wikidata-website-qid-2026-03.csv.gz +domain,qid +oaklandzoo.org,Q2008530 +museumca.org,Q877714 +... +``` + +This mirrors the format and hosting pattern of `qrank.csv.gz` (from `qrank.toolforge.org`), but hosted on `r2-public.protomaps.com` since this is a derived file we generate ourselves. A dated URL (like `Overture-QRank-2025-12-17.parquet`) makes renders reproducible. It would be regenerated periodically by re-running the QLever P856 query and rebuilding. + +### Runtime lookup (two-hop via QRank) + +At render time the file is downloaded once into the sources directory if not present, then loaded into a `WebsiteQidDb` (a new class modeled on `QrankDb`) as a `HashMap` — domain string → numeric Q-ID. + +In `Pois.processOverture`, the website lookup acts as a fallback that fills in a wikidata ID when the feature doesn't have one natively, and then the existing QRank machinery takes over unchanged: + +```java +String wikidata = sf.getString("wikidata"); // always null for Overture places theme +if (wikidata == null) { + String website = /* first entry from sf.getList("websites") */; + wikidata = websiteQidDb.getQid(website); // domain → "Q2008530" +} +long qrank = (wikidata != null) ? qrankDb.get(wikidata) : 0; +var qrankedZoom = QrankDb.assignZoom(qrankGrading, kind, qrank); +``` + +The full lookup chain is: + +``` +sf.websites[0] → domain → Q-ID (WebsiteQidDb) + Q-ID → qrank score (QrankDb) + qrank score → minZoom (assignZoom) +``` + +No changes are needed to `QrankDb` or `assignZoom` — `QrankDb.get(long)` already accepts a numeric ID. A place only benefits if it has a matching website entry *and* a QRank score; otherwise `qrank = 0` and behavior is identical to today. From cbcef9532f54747c2397f7062a5ccb25484685ab Mon Sep 17 00:00:00 2001 From: Michal Migurski Date: Thu, 12 Mar 2026 15:55:14 -0700 Subject: [PATCH 05/10] Add generate-wikidata-website-qid.sh; update WIKIDATA.md Propose a script that will generate a fresh copy of wikidata-website-qid.csv.gz when it is run on a schedule; yes, and if it does update WIKIDATA.md and commit both Co-Authored-By: Claude Sonnet 4.6 --- generate-wikidata-website-qid.sh | 43 ++++++++++++++++++++++++++++++++ tiles/WIKIDATA.md | 2 ++ 2 files changed, 45 insertions(+) create mode 100755 generate-wikidata-website-qid.sh diff --git a/generate-wikidata-website-qid.sh b/generate-wikidata-website-qid.sh new file mode 100755 index 00000000..754ed5a8 --- /dev/null +++ b/generate-wikidata-website-qid.sh @@ -0,0 +1,43 @@ +#!/bin/bash -ex + +# Generate wikidata-website-qid.csv.gz -- a mapping from website domain to Wikidata QID. +# +# Fetches the complete Wikidata P856 (official website) table from QLever, +# extracts root domains, disambiguates multiple QIDs per domain by taking the +# lowest Q-number, and writes a gzipped two-column CSV. +# +# Output: wikidata-website-qid-YYYY-MM.csv.gz +# Usage: ./generate-wikidata-website-qid.sh + +DATE=$(date +%Y-%m) +OUTPUT="wikidata-website-qid-${DATE}.csv.gz" +TSV_TMP=$(mktemp /tmp/wikidata-p856-XXXXXX) && mv "$TSV_TMP" "${TSV_TMP}.tsv" && TSV_TMP="${TSV_TMP}.tsv" + +echo "Fetching Wikidata P856 (official website) from QLever..." +curl \ + -H "Accept: text/tab-separated-values" \ + --data-urlencode "query=PREFIX wdt: SELECT ?item ?website WHERE { ?item wdt:P856 ?website }" \ + --data-urlencode "send=2400000" \ + "https://qlever.dev/api/wikidata" \ + -o "$TSV_TMP" + +echo "Building domain -> QID mapping..." +duckdb -c " +COPY ( + SELECT + regexp_extract(lower(\"?website\"), 'https?://(?:www\\.)?([^/>\?]+)', 1) AS domain, + arg_min( + regexp_extract(\"?item\", 'entity/(Q[0-9]+)', 1), + CAST(regexp_extract(\"?item\", 'Q([0-9]+)', 1) AS INTEGER) + ) AS qid + FROM read_csv('${TSV_TMP}', delim='\t', header=true, ignore_errors=true) + WHERE regexp_extract(\"?item\", 'entity/(Q[0-9]+)', 1) != '' + AND regexp_extract(lower(\"?website\"), 'https?://(?:www\\.)?([^/>\?]+)', 1) != '' + GROUP BY domain + ORDER BY domain +) TO '/dev/stdout' (FORMAT CSV, HEADER true) +" | gzip > "$OUTPUT" + +rm "$TSV_TMP" + +echo "Done: ${OUTPUT} ($(du -sh "$OUTPUT" | cut -f1))" diff --git a/tiles/WIKIDATA.md b/tiles/WIKIDATA.md index 9edded56..e3494e32 100644 --- a/tiles/WIKIDATA.md +++ b/tiles/WIKIDATA.md @@ -119,6 +119,8 @@ museumca.org,Q877714 This mirrors the format and hosting pattern of `qrank.csv.gz` (from `qrank.toolforge.org`), but hosted on `r2-public.protomaps.com` since this is a derived file we generate ourselves. A dated URL (like `Overture-QRank-2025-12-17.parquet`) makes renders reproducible. It would be regenerated periodically by re-running the QLever P856 query and rebuilding. +The file is generated by `generate-wikidata-website-qid.sh` at the repo root. The script fetches the full P856 table from QLever, extracts root domains, disambiguates via lowest Q-number, and writes the gzipped CSV — producing ~1.4M rows at ~15 MB in under 30 seconds. + ### Runtime lookup (two-hop via QRank) At render time the file is downloaded once into the sources directory if not present, then loaded into a `WebsiteQidDb` (a new class modeled on `QrankDb`) as a `HashMap` — domain string → numeric Q-ID. From 74edd5c814b48206a8d2ee1204effaf1c63c7f38 Mon Sep 17 00:00:00 2001 From: Michal Migurski Date: Thu, 12 Mar 2026 15:56:30 -0700 Subject: [PATCH 06/10] Move generate-wikidata-website-qid.sh to tiles/; output to data/sources/ Hold on, let's have scripts live here in tiles/ and resulting data live under data/sources/ with others Co-Authored-By: Claude Sonnet 4.6 --- .../generate-wikidata-website-qid.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) rename generate-wikidata-website-qid.sh => tiles/generate-wikidata-website-qid.sh (92%) diff --git a/generate-wikidata-website-qid.sh b/tiles/generate-wikidata-website-qid.sh similarity index 92% rename from generate-wikidata-website-qid.sh rename to tiles/generate-wikidata-website-qid.sh index 754ed5a8..7741557b 100755 --- a/generate-wikidata-website-qid.sh +++ b/tiles/generate-wikidata-website-qid.sh @@ -6,11 +6,11 @@ # extracts root domains, disambiguates multiple QIDs per domain by taking the # lowest Q-number, and writes a gzipped two-column CSV. # -# Output: wikidata-website-qid-YYYY-MM.csv.gz +# Output: data/sources/wikidata-website-qid-YYYY-MM.csv.gz # Usage: ./generate-wikidata-website-qid.sh DATE=$(date +%Y-%m) -OUTPUT="wikidata-website-qid-${DATE}.csv.gz" +OUTPUT="data/sources/wikidata-website-qid-${DATE}.csv.gz" TSV_TMP=$(mktemp /tmp/wikidata-p856-XXXXXX) && mv "$TSV_TMP" "${TSV_TMP}.tsv" && TSV_TMP="${TSV_TMP}.tsv" echo "Fetching Wikidata P856 (official website) from QLever..." From 7b20fe295cf5908ca8aa522f3b2486615e924f34 Mon Sep 17 00:00:00 2001 From: Michal Migurski Date: Thu, 12 Mar 2026 16:41:29 -0700 Subject: [PATCH 07/10] Implement WebsiteQidDb + QRank-based Overture POI zoom MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add WebsiteQidDb: domain→QID lookup parsed from a gzipped CSV (wikidata-website-qid-2026-03.csv.gz). Overture places features have no native wikidata field, but often carry websites URLs. This enables a two-hop lookup: websites[0] → domain → Q-ID → QRank score → min_zoom. - WebsiteQidDb.java: HashMap backed, fromCsv uses lastIndexOf(',') to handle domain values containing commas; getQid() strips protocol/www/path before lookup - Basemap.java: download + load websiteQidDb after qrankDb; pass to Pois - Pois.java: add websiteQidDb field; fallback website→QID lookup in processOverture when wikidata tag is absent; add zoo/college/museum qrankGrading entries; recalibrate aerodrome/university thresholds so Oakland Airport→zoom 11, Oakland Zoo→zoom 12, UCB→zoom 13, OMCA→zoom 14 - Tests: WebsiteQidDbTest (9 tests), 4 new PoisOvertureTest cases with real Overture UUIDs (f66024a2 airport, a74a40ae zoo, 67e4f788 UCB, 474b271e OMCA), LayerTest fixture expanded with all four Q-IDs Prompt: "Implement the following plan: WebsiteQidDb + QRank-based Overture POI Zoom [...] when you add unit tests concerning Overture features, always include their full UUID so we can trace them back to the original dataset [...] just use CLI duckdb, we already have it" Co-Authored-By: Claude Sonnet 4.6 --- .../java/com/protomaps/basemap/Basemap.java | 21 +++-- .../basemap/feature/WebsiteQidDb.java | 79 ++++++++++++++++++ .../com/protomaps/basemap/layers/Pois.java | 20 ++++- .../basemap/feature/WebsiteQidDbTest.java | 73 ++++++++++++++++ .../protomaps/basemap/layers/LayerTest.java | 15 +++- .../protomaps/basemap/layers/PoisTest.java | 71 +++++++++++++++- .../test/resources/website_qid_fixture.csv.gz | Bin 0 -> 96 bytes 7 files changed, 266 insertions(+), 13 deletions(-) create mode 100644 tiles/src/main/java/com/protomaps/basemap/feature/WebsiteQidDb.java create mode 100644 tiles/src/test/java/com/protomaps/basemap/feature/WebsiteQidDbTest.java create mode 100644 tiles/src/test/resources/website_qid_fixture.csv.gz diff --git a/tiles/src/main/java/com/protomaps/basemap/Basemap.java b/tiles/src/main/java/com/protomaps/basemap/Basemap.java index bc470a57..b62bdc94 100644 --- a/tiles/src/main/java/com/protomaps/basemap/Basemap.java +++ b/tiles/src/main/java/com/protomaps/basemap/Basemap.java @@ -8,6 +8,7 @@ import com.onthegomap.planetiler.util.Downloader; import com.protomaps.basemap.feature.CountryCoder; import com.protomaps.basemap.feature.QrankDb; +import com.protomaps.basemap.feature.WebsiteQidDb; import com.protomaps.basemap.layers.Boundaries; import com.protomaps.basemap.layers.Buildings; import com.protomaps.basemap.layers.Earth; @@ -38,7 +39,7 @@ public class Basemap extends ForwardingProfile { private static final Logger LOGGER = LoggerFactory.getLogger(Basemap.class); - public Basemap(QrankDb qrankDb, CountryCoder countryCoder, Clip clip, + public Basemap(QrankDb qrankDb, WebsiteQidDb websiteQidDb, CountryCoder countryCoder, Clip clip, String layer) { if (layer.isEmpty() || layer.equals(Boundaries.LAYER_NAME)) { @@ -78,7 +79,7 @@ public Basemap(QrankDb qrankDb, CountryCoder countryCoder, Clip clip, } if (layer.isEmpty() || layer.equals(Pois.LAYER_NAME)) { - var poi = new Pois(qrankDb); + var poi = new Pois(qrankDb, websiteQidDb); registerHandler(poi); registerSourceHandler("osm", poi::processOsm); registerSourceHandler("pm:overture", poi::processOverture); @@ -206,12 +207,12 @@ public static void main(String[] args) throws IOException { } private static void printVersion() { - Basemap basemap = new Basemap(null, null, null, ""); + Basemap basemap = new Basemap(null, null, null, null, ""); System.out.println(basemap.version()); } private static void printHelp() { - Basemap basemap = new Basemap(null, null, null, ""); + Basemap basemap = new Basemap(null, null, null, null, ""); System.out.println(String.format(""" %s v%s %s @@ -317,6 +318,16 @@ static void run(Arguments args) throws IOException { var qrankDb = QrankDb.fromCsv(qrankCsv); + Path websiteQidCsv = sourcesDir.resolve("wikidata-website-qid-2026-03.csv.gz"); + if (!Files.exists(websiteQidCsv)) { + Downloader.create(planetiler.config()) + .add("wikidata-website-qid", + "https://954.teczno.com/~migurski/tmp/wikidata-website-qid.csv.gz", + websiteQidCsv) + .run(); + } + var websiteQidDb = WebsiteQidDb.fromCsv(websiteQidCsv); + if (!Files.exists(pgfEncodingZip)) { Downloader.create(planetiler.config()) .add("pgf-encoding", "https://wipfli.github.io/pgf-encoding/pgf-encoding.zip", pgfEncodingZip) @@ -375,7 +386,7 @@ static void run(Arguments args) throws IOException { outputName = area; } - planetiler.setProfile(new Basemap(qrankDb, countryCoder, clip, layer)) + planetiler.setProfile(new Basemap(qrankDb, websiteQidDb, countryCoder, clip, layer)) .setOutput(Path.of(outputName + ".pmtiles")) .run(); } diff --git a/tiles/src/main/java/com/protomaps/basemap/feature/WebsiteQidDb.java b/tiles/src/main/java/com/protomaps/basemap/feature/WebsiteQidDb.java new file mode 100644 index 00000000..96c9586e --- /dev/null +++ b/tiles/src/main/java/com/protomaps/basemap/feature/WebsiteQidDb.java @@ -0,0 +1,79 @@ +package com.protomaps.basemap.feature; + +import java.io.*; +import java.nio.file.Path; +import java.util.HashMap; +import java.util.Map; +import java.util.zip.GZIPInputStream; + +/** + * An in-memory mapping from website domain to Wikidata Q-ID, used to enrich Overture POIs (which lack native wikidata + * fields) for QRank-based zoom assignment. + *

+ * Parses a gzipped CSV with columns {@code domain,qid} into a HashMap for efficient lookup. + **/ +public final class WebsiteQidDb { + + private final Map db; + + public WebsiteQidDb(Map db) { + this.db = db; + } + + /** + * Extracts the root domain from a URL and looks up the corresponding Wikidata Q-ID. + * + * @param url a full URL such as "https://www.iflyoak.com/flights" + * @return a Wikidata Q-ID string like "Q1165584", or null if not found + */ + public String getQid(String url) { + if (url == null || url.isEmpty()) { + return null; + } + String domain = url; + // Strip protocol + if (domain.startsWith("https://")) { + domain = domain.substring("https://".length()); + } else if (domain.startsWith("http://")) { + domain = domain.substring("http://".length()); + } + // Strip www. prefix + if (domain.startsWith("www.")) { + domain = domain.substring("www.".length()); + } + // Take portion up to first / + int slash = domain.indexOf('/'); + if (slash >= 0) { + domain = domain.substring(0, slash); + } + Long id = db.get(domain); + return id != null ? "Q" + id : null; + } + + public static WebsiteQidDb fromCsv(Path csvPath) throws IOException { + GZIPInputStream gzip = new GZIPInputStream(new FileInputStream(csvPath.toFile())); + try (BufferedReader br = new BufferedReader(new InputStreamReader(gzip))) { + String content; + Map db = new HashMap<>(); + String header = br.readLine(); // header + assert (header.equals("domain,qid")); + while ((content = br.readLine()) != null) { + int lastComma = content.lastIndexOf(','); + if (lastComma < 0) { + continue; + } + String domain = content.substring(0, lastComma); + String qid = content.substring(lastComma + 1); + if (qid.startsWith("Q")) { + qid = qid.substring(1); + } + try { + db.put(domain, Long.parseLong(qid)); + } catch (NumberFormatException e) { + // skip malformed rows + } + } + return new WebsiteQidDb(db); + } + } +} diff --git a/tiles/src/main/java/com/protomaps/basemap/layers/Pois.java b/tiles/src/main/java/com/protomaps/basemap/layers/Pois.java index edba20bf..529087f5 100644 --- a/tiles/src/main/java/com/protomaps/basemap/layers/Pois.java +++ b/tiles/src/main/java/com/protomaps/basemap/layers/Pois.java @@ -22,6 +22,7 @@ import com.protomaps.basemap.feature.FeatureId; import com.protomaps.basemap.feature.Matcher; import com.protomaps.basemap.feature.QrankDb; +import com.protomaps.basemap.feature.WebsiteQidDb; import com.protomaps.basemap.names.OsmNames; import java.util.List; import java.util.Map; @@ -31,17 +32,22 @@ public class Pois implements ForwardingProfile.LayerPostProcessor { private Map qrankGrading = Map.of( "station", new int[][]{{10, 50000}, {12, 20000}, {13, 10000}}, - "aerodrome", new int[][]{{10, 50000}, {12, 20000}, {13, 5000}, {14, 2500}}, + "aerodrome", new int[][]{{10, 200000}, {11, 100000}, {12, 20000}, {13, 5000}, {14, 2500}}, "park", new int[][]{{11, 20000}, {12, 10000}, {13, 5000}, {14, 2500}}, "peak", new int[][]{{11, 20000}, {12, 10000}, {13, 5000}, {14, 2500}}, - "attraction", new int[][]{{12, 40000}, {13, 20000}, {14, 10000}}, - "university", new int[][]{{12, 40000}, {13, 20000}, {14, 10000}} + "attraction", new int[][]{{12, 40000}, {13, 20000}, {14, 5000}}, + "university", new int[][]{{12, 2000000}, {13, 500000}, {14, 10000}}, + "college", new int[][]{{12, 2000000}, {13, 500000}, {14, 10000}}, + "zoo", new int[][]{{12, 10000}, {13, 5000}, {14, 2500}}, + "museum", new int[][]{{13, 20000}, {14, 5000}} ); private QrankDb qrankDb; + private WebsiteQidDb websiteQidDb; - public Pois(QrankDb qrankDb) { + public Pois(QrankDb qrankDb, WebsiteQidDb websiteQidDb) { this.qrankDb = qrankDb; + this.websiteQidDb = websiteQidDb; } public static final String LAYER_NAME = "pois"; @@ -564,6 +570,12 @@ public void processOverture(SourceFeature sf, FeatureCollector features) { // QRank may override minZoom entirely String wikidata = sf.getString("wikidata"); + if (wikidata == null && websiteQidDb != null) { + Object websitesObj = sf.getTag("websites"); + if (websitesObj instanceof List websites && !((List) websites).isEmpty()) { + wikidata = websiteQidDb.getQid(websites.get(0).toString()); + } + } long qrank = (wikidata != null) ? qrankDb.get(wikidata) : 0; var qrankedZoom = QrankDb.assignZoom(qrankGrading, kind, qrank); diff --git a/tiles/src/test/java/com/protomaps/basemap/feature/WebsiteQidDbTest.java b/tiles/src/test/java/com/protomaps/basemap/feature/WebsiteQidDbTest.java new file mode 100644 index 00000000..d28a870e --- /dev/null +++ b/tiles/src/test/java/com/protomaps/basemap/feature/WebsiteQidDbTest.java @@ -0,0 +1,73 @@ +package com.protomaps.basemap.feature; + +import static org.junit.jupiter.api.Assertions.*; + +import java.io.IOException; +import java.net.URISyntaxException; +import java.nio.file.Path; +import java.util.Map; +import org.junit.jupiter.api.Test; + +class WebsiteQidDbTest { + + private WebsiteQidDb dbFromFixture() throws IOException, URISyntaxException { + var resource = getClass().getClassLoader().getResource("website_qid_fixture.csv.gz"); + assertNotNull(resource, "Test fixture not found: website_qid_fixture.csv.gz"); + return WebsiteQidDb.fromCsv(Path.of(resource.toURI())); + } + + @Test + void parsesFixtureCsv() throws IOException, URISyntaxException { + var db = dbFromFixture(); + assertEquals("Q2008530", db.getQid("http://www.oaklandzoo.org/")); + assertEquals("Q877714", db.getQid("https://museumca.org/")); + } + + @Test + void stripsHttps() throws IOException, URISyntaxException { + var db = dbFromFixture(); + assertEquals("Q2008530", db.getQid("https://oaklandzoo.org/")); + } + + @Test + void stripsHttp() throws IOException, URISyntaxException { + var db = dbFromFixture(); + assertEquals("Q2008530", db.getQid("http://oaklandzoo.org/")); + } + + @Test + void stripsWww() throws IOException, URISyntaxException { + var db = dbFromFixture(); + assertEquals("Q2008530", db.getQid("http://www.oaklandzoo.org/")); + } + + @Test + void stripsPath() throws IOException, URISyntaxException { + var db = dbFromFixture(); + assertEquals("Q877714", db.getQid("https://museumca.org/visit/hours")); + } + + @Test + void missingDomainReturnsNull() throws IOException, URISyntaxException { + var db = dbFromFixture(); + assertNull(db.getQid("https://example.com/")); + } + + @Test + void nullUrlReturnsNull() { + var db = new WebsiteQidDb(Map.of()); + assertNull(db.getQid(null)); + } + + @Test + void emptyUrlReturnsNull() { + var db = new WebsiteQidDb(Map.of()); + assertNull(db.getQid("")); + } + + @Test + void inMemoryConstructor() { + var db = new WebsiteQidDb(Map.of("iflyoak.com", 1165584L)); + assertEquals("Q1165584", db.getQid("https://www.iflyoak.com/flights")); + } +} diff --git a/tiles/src/test/java/com/protomaps/basemap/layers/LayerTest.java b/tiles/src/test/java/com/protomaps/basemap/layers/LayerTest.java index a376eb85..dbe5cc7c 100644 --- a/tiles/src/test/java/com/protomaps/basemap/layers/LayerTest.java +++ b/tiles/src/test/java/com/protomaps/basemap/layers/LayerTest.java @@ -12,6 +12,7 @@ import com.protomaps.basemap.Basemap; import com.protomaps.basemap.feature.CountryCoder; import com.protomaps.basemap.feature.QrankDb; +import com.protomaps.basemap.feature.WebsiteQidDb; import java.util.List; import java.util.Map; import java.util.stream.StreamSupport; @@ -26,9 +27,19 @@ abstract class LayerTest { "{\"type\":\"FeatureCollection\",\"features\":[{\"type\":\"Feature\",\"properties\":{\"iso1A2\":\"US\",\"nameEn\":\"United States\"},\"geometry\":{\"type\":\"MultiPolygon\",\"coordinates\":[[[[-124,47],[-124,25],[-71,25],[-71,47],[-124,47]]]]}}]}"); - final QrankDb qrankDb = new QrankDb(LongLongHashMap.from(new long[]{8888}, new long[]{100000})); + final QrankDb qrankDb = new QrankDb(LongLongHashMap.from( + new long[]{8888, 1165584, 2008530, 168756, 877714}, + new long[]{100000, 140740, 12197, 1604223, 9227} + )); - final Basemap profile = new Basemap(qrankDb, countryCoder, null, ""); + final WebsiteQidDb websiteQidDb = new WebsiteQidDb(Map.of( + "iflyoak.com", 1165584L, // Oakland Airport Q1165584 + "oaklandzoo.org", 2008530L, // Oakland Zoo Q2008530 + "berkeley.edu", 168756L, // UC Berkeley Q168756 + "museumca.org", 877714L // OMCA Q877714 + )); + + final Basemap profile = new Basemap(qrankDb, websiteQidDb, countryCoder, null, ""); static void assertFeatures(int zoom, List> expected, Iterable actual) { var expectedList = expected.stream().toList(); diff --git a/tiles/src/test/java/com/protomaps/basemap/layers/PoisTest.java b/tiles/src/test/java/com/protomaps/basemap/layers/PoisTest.java index 860216bb..37f82a36 100644 --- a/tiles/src/test/java/com/protomaps/basemap/layers/PoisTest.java +++ b/tiles/src/test/java/com/protomaps/basemap/layers/PoisTest.java @@ -3,6 +3,7 @@ import static com.onthegomap.planetiler.TestUtils.*; import com.onthegomap.planetiler.reader.SimpleFeature; +import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; @@ -92,9 +93,10 @@ void playground() { @Test void withQrank() { + // Q8888 has QRank=100000; aerodrome grading {10,200000},{11,100000},... → minZoom=11 → min_zoom=12 assertFeatures(11, List.of( - Map.of("kind", "aerodrome", "name", "SFO", "min_zoom", 11)), + Map.of("kind", "aerodrome", "name", "SFO", "min_zoom", 12)), process(SimpleFeature.create( newPoint(0, 0), new HashMap<>(Map.of("aeroway", "aerodrome", "name", "SFO", "wikidata", "Q8888")), @@ -106,9 +108,10 @@ void withQrank() { @Test void withQrankPoly() { + // Q8888 has QRank=100000; aerodrome grading {10,200000},{11,100000},... → minZoom=11 → min_zoom=12 assertFeatures(11, List.of(Map.of("kind", "aerodrome"), - Map.of("kind", "aerodrome", "name", "SFO", "min_zoom", 11)), + Map.of("kind", "aerodrome", "name", "SFO", "min_zoom", 12)), process(SimpleFeature.create( newPolygon(0, 0, 0, 1, 1, 1, 1, 0, 0, 0), new HashMap<>(Map.of("aeroway", "aerodrome", "name", "SFO", "wikidata", "Q8888")), @@ -1372,4 +1375,68 @@ void kind_hostel_fromBasicCategory() { "pm:overture", null, 0 ))); } + + @Test + void withQrankViaWebsite_aerodrome_oakland() { + // Oakland International Airport: websites→iflyoak.com→Q1165584, QRank=140740 + // aerodrome grading: 140740 < 200000, 140740 >= 100000 → minZoom=11 → min_zoom=12 + var tags = new HashMap(); + tags.put("id", "f66024a2-99ed-40a1-8c01-a6c93a26b0e4"); + tags.put("theme", "places"); + tags.put("type", "place"); + tags.put("basic_category", "airport"); + tags.put("names.primary", "Oakland International Airport"); + tags.put("websites", new ArrayList<>(List.of("http://www.iflyoak.com/"))); + assertFeatures(11, + List.of(Map.of("kind", "aerodrome", "min_zoom", 12, "name", "Oakland International Airport")), + process(SimpleFeature.create(newPoint(1, 1), tags, "pm:overture", null, 0))); + } + + @Test + void withQrankViaWebsite_zoo_oakland() { + // Oakland Zoo: websites→oaklandzoo.org→Q2008530, QRank=12197 + // zoo grading: 12197 >= 10000 → minZoom=12 → min_zoom=13 + var tags = new HashMap(); + tags.put("id", "a74a40ae-92ae-4fae-958e-da2057fd1bc7"); + tags.put("theme", "places"); + tags.put("type", "place"); + tags.put("basic_category", "zoo"); + tags.put("names.primary", "Oakland Zoo"); + tags.put("websites", new ArrayList<>(List.of("http://www.oaklandzoo.org/"))); + assertFeatures(12, + List.of(Map.of("kind", "zoo", "min_zoom", 13, "name", "Oakland Zoo")), + process(SimpleFeature.create(newPoint(1, 1), tags, "pm:overture", null, 0))); + } + + @Test + void withQrankViaWebsite_college_ucb() { + // UC Berkeley: websites→berkeley.edu→Q168756, QRank=1604223 + // college grading: 1604223 < 2000000, 1604223 >= 500000 → minZoom=13 → min_zoom=14 + var tags = new HashMap(); + tags.put("id", "67e4f788-d72f-4a36-b6e2-5864f928bcb3"); + tags.put("theme", "places"); + tags.put("type", "place"); + tags.put("basic_category", "college_university"); + tags.put("names.primary", "UC Berkeley"); + tags.put("websites", new ArrayList<>(List.of("http://www.berkeley.edu"))); + assertFeatures(13, + List.of(Map.of("kind", "college", "min_zoom", 14, "name", "UC Berkeley")), + process(SimpleFeature.create(newPoint(1, 1), tags, "pm:overture", null, 0))); + } + + @Test + void withQrankViaWebsite_museum_omca() { + // Oakland Museum of California: websites→museumca.org→Q877714, QRank=9227 + // museum grading: 9227 < 20000, 9227 >= 5000 → minZoom=14 → min_zoom=15 + var tags = new HashMap(); + tags.put("id", "474b271e-afeb-49d2-b5c6-29478b555536"); + tags.put("theme", "places"); + tags.put("type", "place"); + tags.put("basic_category", "museum"); + tags.put("names.primary", "Oakland Museum of California"); + tags.put("websites", new ArrayList<>(List.of("http://museumca.org"))); + assertFeatures(14, + List.of(Map.of("kind", "museum", "min_zoom", 15, "name", "Oakland Museum of California")), + process(SimpleFeature.create(newPoint(1, 1), tags, "pm:overture", null, 0))); + } } diff --git a/tiles/src/test/resources/website_qid_fixture.csv.gz b/tiles/src/test/resources/website_qid_fixture.csv.gz new file mode 100644 index 0000000000000000000000000000000000000000..7da4b1ef22e56f3476033abacb99d0e2b32c6128 GIT binary patch literal 96 zcmV-m0H6OKiwFqWOS5SL|953#b7^#CUvX( Date: Thu, 12 Mar 2026 16:42:27 -0700 Subject: [PATCH 08/10] Reformat WIKIDATA.md table (Spotless) Spotless reformatted the markdown table during make lint; committed separately since it was missed from the previous commit. Co-Authored-By: Claude Sonnet 4.6 --- tiles/WIKIDATA.md | 296 +++++++++++++++++++++++----------------------- 1 file changed, 148 insertions(+), 148 deletions(-) diff --git a/tiles/WIKIDATA.md b/tiles/WIKIDATA.md index e3494e32..09de9d28 100644 --- a/tiles/WIKIDATA.md +++ b/tiles/WIKIDATA.md @@ -1,148 +1,148 @@ -# Wikidata ID Availability in Overture Data - -Investigation into whether Overture POIs can be linked to Wikidata for richer rendering. - -## Data sources examined - -`Oakland-visualtests.parquet` — Overture 2026-02-18 release, Bay Area coverage. - -## Findings by theme - -### `places` theme (618,880 features) - -- **Top-level `wikidata` field**: entirely null — 0 out of 618,880 features have a value. -- **`brand.wikidata`**: present for ~11,059 features — chain businesses where the *brand entity* has a Wikidata ID (e.g. `Q177054` for Burger King). Does not help with unique/famous places. -- **Unique/famous places** (Oakland Zoo, Oakland Museum of California, Oakland International Airport, UC Berkeley, etc.): no wikidata at all, neither top-level nor brand. - -### `divisions` theme (3,615 features) - -- **Top-level `wikidata` field**: populated for cities, counties, etc. (e.g. `Q62` for San Francisco, `Q927122` for South San Francisco). -- These feed the `places` map layer via `Places.java`, which already exports the `wikidata` attribute to output tiles (lines 324–325 and 431). - -## Alternative: website domain matching - -Overture places features often include `websites` and `socials` arrays. - -- **`socials`**: contains Facebook URLs with numeric page IDs (e.g. `facebook.com/353030440227`). Wikidata stores Facebook *usernames* (P2013), not numeric IDs — no direct join possible. -- **`websites`**: contains place-specific URLs. Extracting the root domain and matching against Wikidata P856 (official website) works: - - `oaklandzoo.org` → Q2008530 (Oakland Zoo) ✅ - - `museumca.org` → Q877714 (Oakland Museum of California) ✅ - - `oaklandairport.com` / `flyoakland.com` → Oakland International Airport ✅ - -### Scale - -- ~303,848 unique meaningful domains across all places features (excluding generic social/link domains). -- Too large to bulk-query the Wikidata SPARQL endpoint (60-second hard timeout). - -### Caveats - -- Multiple Overture features can share the same domain (e.g. all Oakland Public Library branches → `oaklandlibrary.org`), so the match links to the organization entity, not a specific location. -- Some domains map to multiple QIDs (e.g. `museumca.org` → Q877714, Q133252684, Q30672317) — needs disambiguation. -- No coverage for places without websites (~half of all features). - -## Bulk Wikidata P856 export via QLever - -The Wikidata SPARQL endpoint times out on full P856 scans. [QLever](https://qlever.dev/wikidata) (University of Freiburg) is a faster alternative engine that handles full-dataset scans. The complete P856 table (2.3M rows) was fetched in one query in ~16 seconds: - -``` -PREFIX wdt: -SELECT ?item ?website WHERE { ?item wdt:P856 ?website } -``` - -```sh -curl -H "Accept: text/tab-separated-values" \ - --data-urlencode "query=PREFIX wdt: SELECT ?item ?website WHERE { ?item wdt:P856 ?website }" \ - --data-urlencode "send=2400000" \ - "https://qlever.dev/api/wikidata" \ - -o data/sources/wikidata-p856.tsv -``` - -The TSV was then parsed to extract QID and root domain, and saved as `data/sources/wikidata-p856.parquet` (36 MB, 2.3M rows). - -## Match rate against Overture Oakland data - -Joining `wikidata-p856.parquet` against `Oakland-visualtests.parquet` on root domain (excluding generic domains like facebook.com, yelp.com, etc.): - -- **522,799** places have a usable website URL -- **121,685** (23.3%) matched to at least one Wikidata QID - -Confirmed matches for notable places: -- `oaklandzoo.org` → Q2008530 (Oakland Zoo) ✅ -- `museumca.org` → Q877714 (Oakland Museum of California) ✅ -- `iflyoak.com` → Q1165584 (Oakland International Airport) ✅ -- `berkeley.edu` → Q168756 (UC Berkeley) ✅ - -Top matched categories: doctor, park, government association, medical center, hotel, university, library, landmark. - -## Disambiguation: multiple QIDs per domain - -142,182 domains map to more than one QID in the P856 table (avg 6.7, max 81,822 for a digital library collection domain where every digitized item has its own Wikidata entry). - -### Approaches considered - -**Full URL path matching** — prefer the QID whose stored P856 URL most closely matches the full Overture URL, not just the domain. Ruled out: Wikidata typically stores bare root URLs (`http://www.oaklandzoo.org`), so this rarely breaks ties. - -**P31 instance-of type matching** — fetch P31 (instance of) for all candidate QIDs and prefer the one whose type aligns with the Overture category (e.g. Overture `museum` → prefer QID with `instance of: Q33506`). Ruled out: P31 has 122M rows in Wikidata; QLever serves at most ~15M rows per query and the full download fails. Batching 920k candidate QIDs via the SPARQL endpoint would be slow and fragile. - -**Lowest Q-number tiebreak** — prefer the QID with the smallest numeric value. This works because Wikidata assigns lower Q-numbers to older, more established entities. Exhibitions, digitized sub-items, and branch locations all post-date their parent organizations and receive higher Q-numbers. - -### Result - -The lowest Q-number heuristic gets the right answer in all tested cases: - -| Domain | Winner QID | Entity | -|---|---|---| -| `museumca.org` | Q877714 | Oakland Museum of California (not the exhibitions Q133252684, Q30672317) | -| `oaklandzoo.org` | Q2008530 | Oakland Zoo | -| `iflyoak.com` | Q1165584 | Oakland International Airport | -| `berkeley.edu` | Q168756 | UC Berkeley | -| `oaklandlibrary.org` | Q1090829 | Oakland Public Library (not individual branches) | -| `bart.gov` | Q250113 | Bay Area Rapid Transit | - -### Output - -`data/sources/wikidata-domain-qid.parquet` — 1,432,271 domain → QID mappings, 30 MB. Built by grouping `wikidata-p856.parquet` by domain and taking the minimum Q-number per domain. - -## Integration with the tile build pipeline - -### Distribution format - -The final file should be published as a dated gzipped two-column CSV, e.g.: - -``` -wikidata-website-qid-2026-03.csv.gz -domain,qid -oaklandzoo.org,Q2008530 -museumca.org,Q877714 -... -``` - -This mirrors the format and hosting pattern of `qrank.csv.gz` (from `qrank.toolforge.org`), but hosted on `r2-public.protomaps.com` since this is a derived file we generate ourselves. A dated URL (like `Overture-QRank-2025-12-17.parquet`) makes renders reproducible. It would be regenerated periodically by re-running the QLever P856 query and rebuilding. - -The file is generated by `generate-wikidata-website-qid.sh` at the repo root. The script fetches the full P856 table from QLever, extracts root domains, disambiguates via lowest Q-number, and writes the gzipped CSV — producing ~1.4M rows at ~15 MB in under 30 seconds. - -### Runtime lookup (two-hop via QRank) - -At render time the file is downloaded once into the sources directory if not present, then loaded into a `WebsiteQidDb` (a new class modeled on `QrankDb`) as a `HashMap` — domain string → numeric Q-ID. - -In `Pois.processOverture`, the website lookup acts as a fallback that fills in a wikidata ID when the feature doesn't have one natively, and then the existing QRank machinery takes over unchanged: - -```java -String wikidata = sf.getString("wikidata"); // always null for Overture places theme -if (wikidata == null) { - String website = /* first entry from sf.getList("websites") */; - wikidata = websiteQidDb.getQid(website); // domain → "Q2008530" -} -long qrank = (wikidata != null) ? qrankDb.get(wikidata) : 0; -var qrankedZoom = QrankDb.assignZoom(qrankGrading, kind, qrank); -``` - -The full lookup chain is: - -``` -sf.websites[0] → domain → Q-ID (WebsiteQidDb) - Q-ID → qrank score (QrankDb) - qrank score → minZoom (assignZoom) -``` - -No changes are needed to `QrankDb` or `assignZoom` — `QrankDb.get(long)` already accepts a numeric ID. A place only benefits if it has a matching website entry *and* a QRank score; otherwise `qrank = 0` and behavior is identical to today. +# Wikidata ID Availability in Overture Data + +Investigation into whether Overture POIs can be linked to Wikidata for richer rendering. + +## Data sources examined + +`Oakland-visualtests.parquet` — Overture 2026-02-18 release, Bay Area coverage. + +## Findings by theme + +### `places` theme (618,880 features) + +- **Top-level `wikidata` field**: entirely null — 0 out of 618,880 features have a value. +- **`brand.wikidata`**: present for ~11,059 features — chain businesses where the *brand entity* has a Wikidata ID (e.g. `Q177054` for Burger King). Does not help with unique/famous places. +- **Unique/famous places** (Oakland Zoo, Oakland Museum of California, Oakland International Airport, UC Berkeley, etc.): no wikidata at all, neither top-level nor brand. + +### `divisions` theme (3,615 features) + +- **Top-level `wikidata` field**: populated for cities, counties, etc. (e.g. `Q62` for San Francisco, `Q927122` for South San Francisco). +- These feed the `places` map layer via `Places.java`, which already exports the `wikidata` attribute to output tiles (lines 324–325 and 431). + +## Alternative: website domain matching + +Overture places features often include `websites` and `socials` arrays. + +- **`socials`**: contains Facebook URLs with numeric page IDs (e.g. `facebook.com/353030440227`). Wikidata stores Facebook *usernames* (P2013), not numeric IDs — no direct join possible. +- **`websites`**: contains place-specific URLs. Extracting the root domain and matching against Wikidata P856 (official website) works: + - `oaklandzoo.org` → Q2008530 (Oakland Zoo) ✅ + - `museumca.org` → Q877714 (Oakland Museum of California) ✅ + - `oaklandairport.com` / `flyoakland.com` → Oakland International Airport ✅ + +### Scale + +- ~303,848 unique meaningful domains across all places features (excluding generic social/link domains). +- Too large to bulk-query the Wikidata SPARQL endpoint (60-second hard timeout). + +### Caveats + +- Multiple Overture features can share the same domain (e.g. all Oakland Public Library branches → `oaklandlibrary.org`), so the match links to the organization entity, not a specific location. +- Some domains map to multiple QIDs (e.g. `museumca.org` → Q877714, Q133252684, Q30672317) — needs disambiguation. +- No coverage for places without websites (~half of all features). + +## Bulk Wikidata P856 export via QLever + +The Wikidata SPARQL endpoint times out on full P856 scans. [QLever](https://qlever.dev/wikidata) (University of Freiburg) is a faster alternative engine that handles full-dataset scans. The complete P856 table (2.3M rows) was fetched in one query in ~16 seconds: + +``` +PREFIX wdt: +SELECT ?item ?website WHERE { ?item wdt:P856 ?website } +``` + +```sh +curl -H "Accept: text/tab-separated-values" \ + --data-urlencode "query=PREFIX wdt: SELECT ?item ?website WHERE { ?item wdt:P856 ?website }" \ + --data-urlencode "send=2400000" \ + "https://qlever.dev/api/wikidata" \ + -o data/sources/wikidata-p856.tsv +``` + +The TSV was then parsed to extract QID and root domain, and saved as `data/sources/wikidata-p856.parquet` (36 MB, 2.3M rows). + +## Match rate against Overture Oakland data + +Joining `wikidata-p856.parquet` against `Oakland-visualtests.parquet` on root domain (excluding generic domains like facebook.com, yelp.com, etc.): + +- **522,799** places have a usable website URL +- **121,685** (23.3%) matched to at least one Wikidata QID + +Confirmed matches for notable places: +- `oaklandzoo.org` → Q2008530 (Oakland Zoo) ✅ +- `museumca.org` → Q877714 (Oakland Museum of California) ✅ +- `iflyoak.com` → Q1165584 (Oakland International Airport) ✅ +- `berkeley.edu` → Q168756 (UC Berkeley) ✅ + +Top matched categories: doctor, park, government association, medical center, hotel, university, library, landmark. + +## Disambiguation: multiple QIDs per domain + +142,182 domains map to more than one QID in the P856 table (avg 6.7, max 81,822 for a digital library collection domain where every digitized item has its own Wikidata entry). + +### Approaches considered + +**Full URL path matching** — prefer the QID whose stored P856 URL most closely matches the full Overture URL, not just the domain. Ruled out: Wikidata typically stores bare root URLs (`http://www.oaklandzoo.org`), so this rarely breaks ties. + +**P31 instance-of type matching** — fetch P31 (instance of) for all candidate QIDs and prefer the one whose type aligns with the Overture category (e.g. Overture `museum` → prefer QID with `instance of: Q33506`). Ruled out: P31 has 122M rows in Wikidata; QLever serves at most ~15M rows per query and the full download fails. Batching 920k candidate QIDs via the SPARQL endpoint would be slow and fragile. + +**Lowest Q-number tiebreak** — prefer the QID with the smallest numeric value. This works because Wikidata assigns lower Q-numbers to older, more established entities. Exhibitions, digitized sub-items, and branch locations all post-date their parent organizations and receive higher Q-numbers. + +### Result + +The lowest Q-number heuristic gets the right answer in all tested cases: + +| Domain | Winner QID | Entity | +|----------------------|------------|--------------------------------------------------------------------------| +| `museumca.org` | Q877714 | Oakland Museum of California (not the exhibitions Q133252684, Q30672317) | +| `oaklandzoo.org` | Q2008530 | Oakland Zoo | +| `iflyoak.com` | Q1165584 | Oakland International Airport | +| `berkeley.edu` | Q168756 | UC Berkeley | +| `oaklandlibrary.org` | Q1090829 | Oakland Public Library (not individual branches) | +| `bart.gov` | Q250113 | Bay Area Rapid Transit | + +### Output + +`data/sources/wikidata-domain-qid.parquet` — 1,432,271 domain → QID mappings, 30 MB. Built by grouping `wikidata-p856.parquet` by domain and taking the minimum Q-number per domain. + +## Integration with the tile build pipeline + +### Distribution format + +The final file should be published as a dated gzipped two-column CSV, e.g.: + +``` +wikidata-website-qid-2026-03.csv.gz +domain,qid +oaklandzoo.org,Q2008530 +museumca.org,Q877714 +... +``` + +This mirrors the format and hosting pattern of `qrank.csv.gz` (from `qrank.toolforge.org`), but hosted on `r2-public.protomaps.com` since this is a derived file we generate ourselves. A dated URL (like `Overture-QRank-2025-12-17.parquet`) makes renders reproducible. It would be regenerated periodically by re-running the QLever P856 query and rebuilding. + +The file is generated by `generate-wikidata-website-qid.sh` at the repo root. The script fetches the full P856 table from QLever, extracts root domains, disambiguates via lowest Q-number, and writes the gzipped CSV — producing ~1.4M rows at ~15 MB in under 30 seconds. + +### Runtime lookup (two-hop via QRank) + +At render time the file is downloaded once into the sources directory if not present, then loaded into a `WebsiteQidDb` (a new class modeled on `QrankDb`) as a `HashMap` — domain string → numeric Q-ID. + +In `Pois.processOverture`, the website lookup acts as a fallback that fills in a wikidata ID when the feature doesn't have one natively, and then the existing QRank machinery takes over unchanged: + +```java +String wikidata = sf.getString("wikidata"); // always null for Overture places theme +if (wikidata == null) { + String website = /* first entry from sf.getList("websites") */; + wikidata = websiteQidDb.getQid(website); // domain → "Q2008530" +} +long qrank = (wikidata != null) ? qrankDb.get(wikidata) : 0; +var qrankedZoom = QrankDb.assignZoom(qrankGrading, kind, qrank); +``` + +The full lookup chain is: + +``` +sf.websites[0] → domain → Q-ID (WebsiteQidDb) + Q-ID → qrank score (QrankDb) + qrank score → minZoom (assignZoom) +``` + +No changes are needed to `QrankDb` or `assignZoom` — `QrankDb.get(long)` already accepts a numeric ID. A place only benefits if it has a matching website entry *and* a QRank score; otherwise `qrank = 0` and behavior is identical to today. From 7e9acc7f7604df87d7c821e8b4cd27574cebc032 Mon Sep 17 00:00:00 2001 From: Michal Migurski Date: Thu, 12 Mar 2026 17:02:46 -0700 Subject: [PATCH 09/10] =?UTF-8?q?Restrict=20website=E2=86=92QID=20lookup?= =?UTF-8?q?=20to=20institution=20categories=20and=20high=20confidence?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two guards prevent brand websites from inflating POI zoom levels: 1. Category allowlist: only apply website→QID when basic_category is an institution-level feature (airport, zoo, museum, college_university, etc.). Excludes air_transport_facility_service, travel_service, transportation_location, etc. where the website resolves to a brand entity (e.g. jetblue.com → Q161086 JetBlue Airways) rather than the specific place. 2. Confidence threshold (0.9): low-confidence features are often brand counters or services miscategorised as the institution. Real airports, zoos, etc. cluster at 0.90+; junk like JetBlue-as-airport appears at 0.32. Tests: websiteQid_ineligibleCategory_noEarlyZoom (category guard) and websiteQid_lowConfidence_noEarlyZoom (confidence guard), both using real Overture UUID e67dea74 / 8b6a937e for JetBlue features at OAK. Prompt: "Do option B [...] Comment about why they are eligible in the code [...] and test [...] I still see JetBlue appearing at z12 or even z11, why? [...] good yes and test" Co-Authored-By: Claude Sonnet 4.6 --- .../com/protomaps/basemap/layers/Pois.java | 39 +++++++++++++++++- .../protomaps/basemap/layers/LayerTest.java | 7 ++-- .../protomaps/basemap/layers/PoisTest.java | 41 +++++++++++++++++++ 3 files changed, 82 insertions(+), 5 deletions(-) diff --git a/tiles/src/main/java/com/protomaps/basemap/layers/Pois.java b/tiles/src/main/java/com/protomaps/basemap/layers/Pois.java index 529087f5..0bd18dad 100644 --- a/tiles/src/main/java/com/protomaps/basemap/layers/Pois.java +++ b/tiles/src/main/java/com/protomaps/basemap/layers/Pois.java @@ -548,6 +548,37 @@ public void processOsm(SourceFeature sf, FeatureCollector features) { outputFeature.setPointLabelGridSizeAndLimit(14, 8, 1); } + // Categories where the Overture feature IS the institution itself, so its website + // reliably resolves to that institution's Wikidata QID rather than a corporate brand. + // Excluded: air_transport_facility_service, transportation_location, travel_service, etc. + // — these are sub-facilities or branded counters whose websites resolve to the airline/ + // brand entity (e.g. jetblue.com → Q161086 JetBlue Airways), producing spuriously high + // QRank scores that push check-in counters and baggage claims to early zoom levels. + private static final java.util.Set WEBSITE_QID_ELIGIBLE_CATEGORIES = java.util.Set.of( + "airport", // the airport itself, not airline counters inside it + "zoo", // institution-level feature + "museum", // institution-level feature + "art_museum", // institution-level feature + "college_university", // institution-level feature + "university", // institution-level feature + "park", // institution-level feature + "national_park", // institution-level feature + "aquarium", // institution-level feature + "botanical_garden", // institution-level feature + "stadium", // institution-level feature + "library" // institution-level feature + ); + + // Minimum confidence for website→QID lookup. Low-confidence features are often + // brand counters or services miscategorised as the institution (e.g. JetBlue at 0.32 + // tagged basic_category=airport). Real airports/zoos/etc. cluster at 0.90+. + private static final double WEBSITE_QID_MIN_CONFIDENCE = 0.9; + + private static boolean isWebsiteQidEligible(String basicCategory, double confidence) { + return basicCategory != null && WEBSITE_QID_ELIGIBLE_CATEGORIES.contains(basicCategory) && + confidence >= WEBSITE_QID_MIN_CONFIDENCE; + } + public void processOverture(SourceFeature sf, FeatureCollector features) { // Filter by type field - Overture transportation theme if (!"places".equals(sf.getString("theme"))) { @@ -568,9 +599,13 @@ public void processOverture(SourceFeature sf, FeatureCollector features) { if (kind.equals("pm:undefined")) return; - // QRank may override minZoom entirely + // QRank may override minZoom entirely. + // Website→QID lookup is restricted to categories where the feature IS the institution + // (airport, zoo, museum, etc.) — not sub-facilities of branded services where the + // website resolves to a corporate brand entity rather than the specific place. String wikidata = sf.getString("wikidata"); - if (wikidata == null && websiteQidDb != null) { + double confidence = sf.getTag("confidence")instanceof Number n ? n.doubleValue() : 0.0; + if (wikidata == null && websiteQidDb != null && isWebsiteQidEligible(sf.getString("basic_category"), confidence)) { Object websitesObj = sf.getTag("websites"); if (websitesObj instanceof List websites && !((List) websites).isEmpty()) { wikidata = websiteQidDb.getQid(websites.get(0).toString()); diff --git a/tiles/src/test/java/com/protomaps/basemap/layers/LayerTest.java b/tiles/src/test/java/com/protomaps/basemap/layers/LayerTest.java index dbe5cc7c..04a2fa09 100644 --- a/tiles/src/test/java/com/protomaps/basemap/layers/LayerTest.java +++ b/tiles/src/test/java/com/protomaps/basemap/layers/LayerTest.java @@ -28,15 +28,16 @@ abstract class LayerTest { final QrankDb qrankDb = new QrankDb(LongLongHashMap.from( - new long[]{8888, 1165584, 2008530, 168756, 877714}, - new long[]{100000, 140740, 12197, 1604223, 9227} + new long[]{8888, 1165584, 2008530, 168756, 877714, 161086}, + new long[]{100000, 140740, 12197, 1604223, 9227, 5000000} )); final WebsiteQidDb websiteQidDb = new WebsiteQidDb(Map.of( "iflyoak.com", 1165584L, // Oakland Airport Q1165584 "oaklandzoo.org", 2008530L, // Oakland Zoo Q2008530 "berkeley.edu", 168756L, // UC Berkeley Q168756 - "museumca.org", 877714L // OMCA Q877714 + "museumca.org", 877714L, // OMCA Q877714 + "jetblue.com", 161086L // JetBlue Airways Q161086 (airline brand, not a place) )); final Basemap profile = new Basemap(qrankDb, websiteQidDb, countryCoder, null, ""); diff --git a/tiles/src/test/java/com/protomaps/basemap/layers/PoisTest.java b/tiles/src/test/java/com/protomaps/basemap/layers/PoisTest.java index 37f82a36..6003feff 100644 --- a/tiles/src/test/java/com/protomaps/basemap/layers/PoisTest.java +++ b/tiles/src/test/java/com/protomaps/basemap/layers/PoisTest.java @@ -1376,6 +1376,43 @@ void kind_hostel_fromBasicCategory() { ))); } + @Test + void websiteQid_ineligibleCategory_noEarlyZoom() { + // JetBlue counter at Oakland Airport: basic_category=air_transport_facility_service + // jetblue.com → Q161086 (JetBlue Airways, QRank=5M) — but the category is ineligible + // for website→QID lookup, so the high brand QRank must NOT promote it to an early zoom. + // Before the eligibility allowlist this would have resolved to zoom 10. + var tags = new HashMap(); + tags.put("id", "e67dea74-eb8c-47e8-bfd3-80af26dd7d5c"); + tags.put("theme", "places"); + tags.put("type", "place"); + tags.put("basic_category", "air_transport_facility_service"); + tags.put("confidence", 0.64); + tags.put("names.primary", "JetBlue Airways"); + tags.put("websites", new ArrayList<>(List.of("http://www.jetblue.com"))); + assertFeatures(15, + List.of(Map.of("kind", "air_transport_facility_service", "min_zoom", 16, "name", "JetBlue Airways")), + process(SimpleFeature.create(newPoint(1, 1), tags, "pm:overture", null, 0))); + } + + @Test + void websiteQid_lowConfidence_noEarlyZoom() { + // JetBlue miscategorized as basic_category=airport at confidence=0.32 — the eligible + // category passes the allowlist, but low confidence must block the website→QID lookup. + // jetblue.com → Q161086 (QRank=5M) would otherwise promote this to zoom 10. + var tags = new HashMap(); + tags.put("id", "8b6a937e-c32d-436b-b5cc-397fb8f978f2"); + tags.put("theme", "places"); + tags.put("type", "place"); + tags.put("basic_category", "airport"); + tags.put("confidence", 0.32); + tags.put("names.primary", "JetBlue"); + tags.put("websites", new ArrayList<>(List.of("http://www.jetblue.com"))); + assertFeatures(15, + List.of(Map.of("kind", "aerodrome", "min_zoom", 14, "name", "JetBlue")), + process(SimpleFeature.create(newPoint(1, 1), tags, "pm:overture", null, 0))); + } + @Test void withQrankViaWebsite_aerodrome_oakland() { // Oakland International Airport: websites→iflyoak.com→Q1165584, QRank=140740 @@ -1385,6 +1422,7 @@ void withQrankViaWebsite_aerodrome_oakland() { tags.put("theme", "places"); tags.put("type", "place"); tags.put("basic_category", "airport"); + tags.put("confidence", 0.9973583119336149); tags.put("names.primary", "Oakland International Airport"); tags.put("websites", new ArrayList<>(List.of("http://www.iflyoak.com/"))); assertFeatures(11, @@ -1401,6 +1439,7 @@ void withQrankViaWebsite_zoo_oakland() { tags.put("theme", "places"); tags.put("type", "place"); tags.put("basic_category", "zoo"); + tags.put("confidence", 0.9735918045043945); tags.put("names.primary", "Oakland Zoo"); tags.put("websites", new ArrayList<>(List.of("http://www.oaklandzoo.org/"))); assertFeatures(12, @@ -1417,6 +1456,7 @@ void withQrankViaWebsite_college_ucb() { tags.put("theme", "places"); tags.put("type", "place"); tags.put("basic_category", "college_university"); + tags.put("confidence", 0.9735918045043945); tags.put("names.primary", "UC Berkeley"); tags.put("websites", new ArrayList<>(List.of("http://www.berkeley.edu"))); assertFeatures(13, @@ -1433,6 +1473,7 @@ void withQrankViaWebsite_museum_omca() { tags.put("theme", "places"); tags.put("type", "place"); tags.put("basic_category", "museum"); + tags.put("confidence", 0.9500626074407351); tags.put("names.primary", "Oakland Museum of California"); tags.put("websites", new ArrayList<>(List.of("http://museumca.org"))); assertFeatures(14, From 9bdfcfe93542f9a61ced9f90f54fee2e3bdb026e Mon Sep 17 00:00:00 2001 From: Michal Migurski Date: Thu, 12 Mar 2026 17:12:52 -0700 Subject: [PATCH 10/10] Use Overture confidence for POI filtering and rendering priority Drop features below confidence 0.65 (junk tier: ~127k features dominated by real estate listings, beauty salons, ATMs from uncertain sources). Within the remaining features, use confidence to break sort key ties so higher-confidence POIs win label collision resolution at the same zoom. Sort key: minZoom * 1000 - (int)(confidence * 100), so confidence=0.99 scores 99 points lower (higher priority) than confidence=0.65. Tests updated: websiteQid_ineligibleCategory_dropped and websiteQid_lowConfidence_dropped now correctly expect zero features. kind_nationalPark_fromBasicCategory switched to Pinnacles National Park (4d619bc0, confidence=0.917) since the previous Alcatraz fixture (814b8a78, confidence=0.639) falls below the new cutoff. Prompt: "Let's bring more Overture confidence into POI rendering: make higher-confidence POIs higher rendering priority, and simply omit ones below 0.65 (junk tier)" Co-Authored-By: Claude Sonnet 4.6 --- .../com/protomaps/basemap/layers/Pois.java | 15 ++++++++-- .../protomaps/basemap/layers/PoisTest.java | 30 +++++++++---------- 2 files changed, 28 insertions(+), 17 deletions(-) diff --git a/tiles/src/main/java/com/protomaps/basemap/layers/Pois.java b/tiles/src/main/java/com/protomaps/basemap/layers/Pois.java index 0bd18dad..024687e0 100644 --- a/tiles/src/main/java/com/protomaps/basemap/layers/Pois.java +++ b/tiles/src/main/java/com/protomaps/basemap/layers/Pois.java @@ -599,12 +599,18 @@ public void processOverture(SourceFeature sf, FeatureCollector features) { if (kind.equals("pm:undefined")) return; + // Drop low-confidence features. Below 0.65, features are dominated by uncertain data: + // real estate listings, auto repair, beauty salons, ATMs from low-quality sources. + double confidence = sf.getTag("confidence")instanceof Number n ? n.doubleValue() : 0.0; + if (confidence < 0.65) { + return; + } + // QRank may override minZoom entirely. // Website→QID lookup is restricted to categories where the feature IS the institution // (airport, zoo, museum, etc.) — not sub-facilities of branded services where the // website resolves to a corporate brand entity rather than the specific place. String wikidata = sf.getString("wikidata"); - double confidence = sf.getTag("confidence")instanceof Number n ? n.doubleValue() : 0.0; if (wikidata == null && websiteQidDb != null && isWebsiteQidEligible(sf.getString("basic_category"), confidence)) { Object websitesObj = sf.getTag("websites"); if (websitesObj instanceof List websites && !((List) websites).isEmpty()) { @@ -628,6 +634,10 @@ public void processOverture(SourceFeature sf, FeatureCollector features) { String name = sf.getString("names.primary"); + // Sort key: lower = higher rendering priority. Within the same minZoom bucket, + // higher confidence wins (subtract confidence*100 so 0.99 → -99, 0.65 → -65). + int sortKey = minZoom * 1000 - (int) (confidence * 100); + features.point(this.name()) // all POIs should receive their IDs at all zooms // (there is no merging of POIs like with lines and polygons in other layers) @@ -640,7 +650,8 @@ public void processOverture(SourceFeature sf, FeatureCollector features) { .setAttr("min_zoom", minZoom + 1) // .setBufferPixels(8) - .setZoomRange(Math.min(minZoom, 15), 15); + .setZoomRange(Math.min(minZoom, 15), 15) + .setSortKey(sortKey); } @Override diff --git a/tiles/src/test/java/com/protomaps/basemap/layers/PoisTest.java b/tiles/src/test/java/com/protomaps/basemap/layers/PoisTest.java index 6003feff..0a6b1b26 100644 --- a/tiles/src/test/java/com/protomaps/basemap/layers/PoisTest.java +++ b/tiles/src/test/java/com/protomaps/basemap/layers/PoisTest.java @@ -1143,16 +1143,16 @@ class PoisOvertureTest extends LayerTest { @Test void kind_nationalPark_fromBasicCategory() { assertFeatures(15, - List.of(Map.of("kind", "national_park", "min_zoom", 12, "name", "Alcatraz National Park")), + List.of(Map.of("kind", "national_park", "min_zoom", 12, "name", "Pinnacles National Park")), process(SimpleFeature.create( newPoint(1, 1), new HashMap<>(Map.of( - "id", "814b8a78-161f-4273-a4bb-7d686d0e3be4", // https://www.openstreetmap.org/way/295140461/history/15 + "id", "4d619bc0-d30c-4dbe-9f8b-079cf06c1a39", "theme", "places", "type", "place", "basic_category", "national_park", - "names.primary", "Alcatraz National Park", - "confidence", 0.64 + "names.primary", "Pinnacles National Park", + "confidence", 0.917024286724829 )), "pm:overture", null, 0 ))); @@ -1377,11 +1377,10 @@ void kind_hostel_fromBasicCategory() { } @Test - void websiteQid_ineligibleCategory_noEarlyZoom() { - // JetBlue counter at Oakland Airport: basic_category=air_transport_facility_service - // jetblue.com → Q161086 (JetBlue Airways, QRank=5M) — but the category is ineligible - // for website→QID lookup, so the high brand QRank must NOT promote it to an early zoom. - // Before the eligibility allowlist this would have resolved to zoom 10. + void websiteQid_ineligibleCategory_dropped() { + // JetBlue counter at Oakland Airport: basic_category=air_transport_facility_service, + // confidence=0.64 (below 0.65 cutoff) — dropped entirely before any kind or QID lookup. + // Before the confidence cutoff this would have appeared at min_zoom=16. var tags = new HashMap(); tags.put("id", "e67dea74-eb8c-47e8-bfd3-80af26dd7d5c"); tags.put("theme", "places"); @@ -1391,15 +1390,16 @@ void websiteQid_ineligibleCategory_noEarlyZoom() { tags.put("names.primary", "JetBlue Airways"); tags.put("websites", new ArrayList<>(List.of("http://www.jetblue.com"))); assertFeatures(15, - List.of(Map.of("kind", "air_transport_facility_service", "min_zoom", 16, "name", "JetBlue Airways")), + List.of(), process(SimpleFeature.create(newPoint(1, 1), tags, "pm:overture", null, 0))); } @Test - void websiteQid_lowConfidence_noEarlyZoom() { - // JetBlue miscategorized as basic_category=airport at confidence=0.32 — the eligible - // category passes the allowlist, but low confidence must block the website→QID lookup. - // jetblue.com → Q161086 (QRank=5M) would otherwise promote this to zoom 10. + void websiteQid_lowConfidence_dropped() { + // JetBlue miscategorized as basic_category=airport at confidence=0.32 — below the 0.65 + // cutoff, so the feature is dropped entirely before any website→QID lookup can fire. + // Before the confidence cutoff, jetblue.com → Q161086 (QRank=5M) would have promoted + // this to zoom 10. var tags = new HashMap(); tags.put("id", "8b6a937e-c32d-436b-b5cc-397fb8f978f2"); tags.put("theme", "places"); @@ -1409,7 +1409,7 @@ void websiteQid_lowConfidence_noEarlyZoom() { tags.put("names.primary", "JetBlue"); tags.put("websites", new ArrayList<>(List.of("http://www.jetblue.com"))); assertFeatures(15, - List.of(Map.of("kind", "aerodrome", "min_zoom", 14, "name", "JetBlue")), + List.of(), process(SimpleFeature.create(newPoint(1, 1), tags, "pm:overture", null, 0))); }