Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 148 additions & 0 deletions tiles/WIKIDATA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# Wikidata ID Availability in Overture Data

Investigation into whether Overture POIs can be linked to Wikidata for richer rendering.

## Data sources examined

`Oakland-visualtests.parquet` — Overture 2026-02-18 release, Bay Area coverage.

## Findings by theme

### `places` theme (618,880 features)

- **Top-level `wikidata` field**: entirely null — 0 out of 618,880 features have a value.
- **`brand.wikidata`**: present for ~11,059 features — chain businesses where the *brand entity* has a Wikidata ID (e.g. `Q177054` for Burger King). Does not help with unique/famous places.
- **Unique/famous places** (Oakland Zoo, Oakland Museum of California, Oakland International Airport, UC Berkeley, etc.): no wikidata at all, neither top-level nor brand.

### `divisions` theme (3,615 features)

- **Top-level `wikidata` field**: populated for cities, counties, etc. (e.g. `Q62` for San Francisco, `Q927122` for South San Francisco).
- These feed the `places` map layer via `Places.java`, which already exports the `wikidata` attribute to output tiles (lines 324–325 and 431).

## Alternative: website domain matching

Overture places features often include `websites` and `socials` arrays.

- **`socials`**: contains Facebook URLs with numeric page IDs (e.g. `facebook.com/353030440227`). Wikidata stores Facebook *usernames* (P2013), not numeric IDs — no direct join possible.
- **`websites`**: contains place-specific URLs. Extracting the root domain and matching against Wikidata P856 (official website) works:
- `oaklandzoo.org` → Q2008530 (Oakland Zoo) ✅
- `museumca.org` → Q877714 (Oakland Museum of California) ✅
- `oaklandairport.com` / `flyoakland.com` → Oakland International Airport ✅

### Scale

- ~303,848 unique meaningful domains across all places features (excluding generic social/link domains).
- Too large to bulk-query the Wikidata SPARQL endpoint (60-second hard timeout).

### Caveats

- Multiple Overture features can share the same domain (e.g. all Oakland Public Library branches → `oaklandlibrary.org`), so the match links to the organization entity, not a specific location.
- Some domains map to multiple QIDs (e.g. `museumca.org` → Q877714, Q133252684, Q30672317) — needs disambiguation.
- No coverage for places without websites (~half of all features).

## Bulk Wikidata P856 export via QLever

The Wikidata SPARQL endpoint times out on full P856 scans. [QLever](https://qlever.dev/wikidata) (University of Freiburg) is a faster alternative engine that handles full-dataset scans. The complete P856 table (2.3M rows) was fetched in one query in ~16 seconds:

```
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?item ?website WHERE { ?item wdt:P856 ?website }
```

```sh
curl -H "Accept: text/tab-separated-values" \
--data-urlencode "query=PREFIX wdt: <http://www.wikidata.org/prop/direct/> SELECT ?item ?website WHERE { ?item wdt:P856 ?website }" \
--data-urlencode "send=2400000" \
"https://qlever.dev/api/wikidata" \
-o data/sources/wikidata-p856.tsv
```

The TSV was then parsed to extract QID and root domain, and saved as `data/sources/wikidata-p856.parquet` (36 MB, 2.3M rows).

## Match rate against Overture Oakland data

Joining `wikidata-p856.parquet` against `Oakland-visualtests.parquet` on root domain (excluding generic domains like facebook.com, yelp.com, etc.):

- **522,799** places have a usable website URL
- **121,685** (23.3%) matched to at least one Wikidata QID

Confirmed matches for notable places:
- `oaklandzoo.org` → Q2008530 (Oakland Zoo) ✅
- `museumca.org` → Q877714 (Oakland Museum of California) ✅
- `iflyoak.com` → Q1165584 (Oakland International Airport) ✅
- `berkeley.edu` → Q168756 (UC Berkeley) ✅

Top matched categories: doctor, park, government association, medical center, hotel, university, library, landmark.

## Disambiguation: multiple QIDs per domain

142,182 domains map to more than one QID in the P856 table (avg 6.7, max 81,822 for a digital library collection domain where every digitized item has its own Wikidata entry).

### Approaches considered

**Full URL path matching** — prefer the QID whose stored P856 URL most closely matches the full Overture URL, not just the domain. Ruled out: Wikidata typically stores bare root URLs (`http://www.oaklandzoo.org`), so this rarely breaks ties.

**P31 instance-of type matching** — fetch P31 (instance of) for all candidate QIDs and prefer the one whose type aligns with the Overture category (e.g. Overture `museum` → prefer QID with `instance of: Q33506`). Ruled out: P31 has 122M rows in Wikidata; QLever serves at most ~15M rows per query and the full download fails. Batching 920k candidate QIDs via the SPARQL endpoint would be slow and fragile.

**Lowest Q-number tiebreak** — prefer the QID with the smallest numeric value. This works because Wikidata assigns lower Q-numbers to older, more established entities. Exhibitions, digitized sub-items, and branch locations all post-date their parent organizations and receive higher Q-numbers.

### Result

The lowest Q-number heuristic gets the right answer in all tested cases:

| Domain | Winner QID | Entity |
|----------------------|------------|--------------------------------------------------------------------------|
| `museumca.org` | Q877714 | Oakland Museum of California (not the exhibitions Q133252684, Q30672317) |
| `oaklandzoo.org` | Q2008530 | Oakland Zoo |
| `iflyoak.com` | Q1165584 | Oakland International Airport |
| `berkeley.edu` | Q168756 | UC Berkeley |
| `oaklandlibrary.org` | Q1090829 | Oakland Public Library (not individual branches) |
| `bart.gov` | Q250113 | Bay Area Rapid Transit |

### Output

`data/sources/wikidata-domain-qid.parquet` — 1,432,271 domain → QID mappings, 30 MB. Built by grouping `wikidata-p856.parquet` by domain and taking the minimum Q-number per domain.

## Integration with the tile build pipeline

### Distribution format

The final file should be published as a dated gzipped two-column CSV, e.g.:

```
wikidata-website-qid-2026-03.csv.gz
domain,qid
oaklandzoo.org,Q2008530
museumca.org,Q877714
...
```

This mirrors the format and hosting pattern of `qrank.csv.gz` (from `qrank.toolforge.org`), but hosted on `r2-public.protomaps.com` since this is a derived file we generate ourselves. A dated URL (like `Overture-QRank-2025-12-17.parquet`) makes renders reproducible. It would be regenerated periodically by re-running the QLever P856 query and rebuilding.

The file is generated by `generate-wikidata-website-qid.sh` at the repo root. The script fetches the full P856 table from QLever, extracts root domains, disambiguates via lowest Q-number, and writes the gzipped CSV — producing ~1.4M rows at ~15 MB in under 30 seconds.

### Runtime lookup (two-hop via QRank)

At render time the file is downloaded once into the sources directory if not present, then loaded into a `WebsiteQidDb` (a new class modeled on `QrankDb`) as a `HashMap<String, Long>` — domain string → numeric Q-ID.

In `Pois.processOverture`, the website lookup acts as a fallback that fills in a wikidata ID when the feature doesn't have one natively, and then the existing QRank machinery takes over unchanged:

```java
String wikidata = sf.getString("wikidata"); // always null for Overture places theme
if (wikidata == null) {
String website = /* first entry from sf.getList("websites") */;
wikidata = websiteQidDb.getQid(website); // domain → "Q2008530"
}
long qrank = (wikidata != null) ? qrankDb.get(wikidata) : 0;
var qrankedZoom = QrankDb.assignZoom(qrankGrading, kind, qrank);
```

The full lookup chain is:

```
sf.websites[0] → domain → Q-ID (WebsiteQidDb)
Q-ID → qrank score (QrankDb)
qrank score → minZoom (assignZoom)
```

No changes are needed to `QrankDb` or `assignZoom` — `QrankDb.get(long)` already accepts a numeric ID. A place only benefits if it has a matching website entry *and* a QRank score; otherwise `qrank = 0` and behavior is identical to today.
43 changes: 43 additions & 0 deletions tiles/generate-wikidata-website-qid.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
#!/bin/bash -ex

# Generate wikidata-website-qid.csv.gz -- a mapping from website domain to Wikidata QID.
#
# Fetches the complete Wikidata P856 (official website) table from QLever,
# extracts root domains, disambiguates multiple QIDs per domain by taking the
# lowest Q-number, and writes a gzipped two-column CSV.
#
# Output: data/sources/wikidata-website-qid-YYYY-MM.csv.gz
# Usage: ./generate-wikidata-website-qid.sh

DATE=$(date +%Y-%m)
OUTPUT="data/sources/wikidata-website-qid-${DATE}.csv.gz"
TSV_TMP=$(mktemp /tmp/wikidata-p856-XXXXXX) && mv "$TSV_TMP" "${TSV_TMP}.tsv" && TSV_TMP="${TSV_TMP}.tsv"

echo "Fetching Wikidata P856 (official website) from QLever..."
curl \
-H "Accept: text/tab-separated-values" \
--data-urlencode "query=PREFIX wdt: <http://www.wikidata.org/prop/direct/> SELECT ?item ?website WHERE { ?item wdt:P856 ?website }" \
--data-urlencode "send=2400000" \
"https://qlever.dev/api/wikidata" \
-o "$TSV_TMP"

echo "Building domain -> QID mapping..."
duckdb -c "
COPY (
SELECT
regexp_extract(lower(\"?website\"), 'https?://(?:www\\.)?([^/>\?]+)', 1) AS domain,
arg_min(
regexp_extract(\"?item\", 'entity/(Q[0-9]+)', 1),
CAST(regexp_extract(\"?item\", 'Q([0-9]+)', 1) AS INTEGER)
) AS qid
FROM read_csv('${TSV_TMP}', delim='\t', header=true, ignore_errors=true)
WHERE regexp_extract(\"?item\", 'entity/(Q[0-9]+)', 1) != ''
AND regexp_extract(lower(\"?website\"), 'https?://(?:www\\.)?([^/>\?]+)', 1) != ''
GROUP BY domain
ORDER BY domain
) TO '/dev/stdout' (FORMAT CSV, HEADER true)
" | gzip > "$OUTPUT"

rm "$TSV_TMP"

echo "Done: ${OUTPUT} ($(du -sh "$OUTPUT" | cut -f1))"
21 changes: 16 additions & 5 deletions tiles/src/main/java/com/protomaps/basemap/Basemap.java
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import com.onthegomap.planetiler.util.Downloader;
import com.protomaps.basemap.feature.CountryCoder;
import com.protomaps.basemap.feature.QrankDb;
import com.protomaps.basemap.feature.WebsiteQidDb;
import com.protomaps.basemap.layers.Boundaries;
import com.protomaps.basemap.layers.Buildings;
import com.protomaps.basemap.layers.Earth;
Expand Down Expand Up @@ -38,7 +39,7 @@ public class Basemap extends ForwardingProfile {

private static final Logger LOGGER = LoggerFactory.getLogger(Basemap.class);

public Basemap(QrankDb qrankDb, CountryCoder countryCoder, Clip clip,
public Basemap(QrankDb qrankDb, WebsiteQidDb websiteQidDb, CountryCoder countryCoder, Clip clip,
String layer) {

if (layer.isEmpty() || layer.equals(Boundaries.LAYER_NAME)) {
Expand Down Expand Up @@ -78,7 +79,7 @@ public Basemap(QrankDb qrankDb, CountryCoder countryCoder, Clip clip,
}

if (layer.isEmpty() || layer.equals(Pois.LAYER_NAME)) {
var poi = new Pois(qrankDb);
var poi = new Pois(qrankDb, websiteQidDb);
registerHandler(poi);
registerSourceHandler("osm", poi::processOsm);
registerSourceHandler("pm:overture", poi::processOverture);
Expand Down Expand Up @@ -206,12 +207,12 @@ public static void main(String[] args) throws IOException {
}

private static void printVersion() {
Basemap basemap = new Basemap(null, null, null, "");
Basemap basemap = new Basemap(null, null, null, null, "");
System.out.println(basemap.version());
}

private static void printHelp() {
Basemap basemap = new Basemap(null, null, null, "");
Basemap basemap = new Basemap(null, null, null, null, "");
System.out.println(String.format("""
%s v%s
%s
Expand Down Expand Up @@ -317,6 +318,16 @@ static void run(Arguments args) throws IOException {

var qrankDb = QrankDb.fromCsv(qrankCsv);

Path websiteQidCsv = sourcesDir.resolve("wikidata-website-qid-2026-03.csv.gz");
if (!Files.exists(websiteQidCsv)) {
Downloader.create(planetiler.config())
.add("wikidata-website-qid",
"https://954.teczno.com/~migurski/tmp/wikidata-website-qid.csv.gz",
websiteQidCsv)
.run();
}
var websiteQidDb = WebsiteQidDb.fromCsv(websiteQidCsv);

if (!Files.exists(pgfEncodingZip)) {
Downloader.create(planetiler.config())
.add("pgf-encoding", "https://wipfli.github.io/pgf-encoding/pgf-encoding.zip", pgfEncodingZip)
Expand Down Expand Up @@ -375,7 +386,7 @@ static void run(Arguments args) throws IOException {
outputName = area;
}

planetiler.setProfile(new Basemap(qrankDb, countryCoder, clip, layer))
planetiler.setProfile(new Basemap(qrankDb, websiteQidDb, countryCoder, clip, layer))
.setOutput(Path.of(outputName + ".pmtiles"))
.run();
}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
package com.protomaps.basemap.feature;

import java.io.*;
import java.nio.file.Path;
import java.util.HashMap;
import java.util.Map;
import java.util.zip.GZIPInputStream;

/**
* An in-memory mapping from website domain to Wikidata Q-ID, used to enrich Overture POIs (which lack native wikidata
* fields) for QRank-based zoom assignment.
* <p>
* Parses a gzipped CSV with columns {@code domain,qid} into a HashMap for efficient lookup.
**/
public final class WebsiteQidDb {

private final Map<String, Long> db;

public WebsiteQidDb(Map<String, Long> db) {
this.db = db;
}

/**
* Extracts the root domain from a URL and looks up the corresponding Wikidata Q-ID.
*
* @param url a full URL such as "https://www.iflyoak.com/flights"
* @return a Wikidata Q-ID string like "Q1165584", or null if not found
*/
public String getQid(String url) {
if (url == null || url.isEmpty()) {
return null;
}
String domain = url;
// Strip protocol
if (domain.startsWith("https://")) {
domain = domain.substring("https://".length());
} else if (domain.startsWith("http://")) {
domain = domain.substring("http://".length());
}
// Strip www. prefix
if (domain.startsWith("www.")) {
domain = domain.substring("www.".length());
}
// Take portion up to first /
int slash = domain.indexOf('/');
if (slash >= 0) {
domain = domain.substring(0, slash);
}
Long id = db.get(domain);
return id != null ? "Q" + id : null;
}

public static WebsiteQidDb fromCsv(Path csvPath) throws IOException {
GZIPInputStream gzip = new GZIPInputStream(new FileInputStream(csvPath.toFile()));
try (BufferedReader br = new BufferedReader(new InputStreamReader(gzip))) {
String content;
Map<String, Long> db = new HashMap<>();
String header = br.readLine(); // header
assert (header.equals("domain,qid"));
while ((content = br.readLine()) != null) {
int lastComma = content.lastIndexOf(',');
if (lastComma < 0) {
continue;
}
String domain = content.substring(0, lastComma);
String qid = content.substring(lastComma + 1);
if (qid.startsWith("Q")) {
qid = qid.substring(1);
}
try {
db.put(domain, Long.parseLong(qid));
} catch (NumberFormatException e) {
// skip malformed rows
}
}
return new WebsiteQidDb(db);
}
}
}
Loading