Place Geocoding

Adds geographic coordinates to 37,000 curated place names

The Rijksmuseum's collection metadata references nearly 37,000 distinct places — production sites, depicted locations, and artist birthplaces spanning five continents and five centuries. Most include geographic authority IDs but no latitude/longitude coordinates. rijksmuseum-mcp+ geocodes these places by cascading through multiple authority databases, enabling proximity search, geographic analysis, and spatial reasoning over the entire collection. Every geocoded coordinate carries a three-tier provenance tag (authority / derived / manual) so its source can be audited.

36,907
distinct places
29,726
geocoded (81%)
5
authority sources
1.4M
artwork–place links

The problem

Places appear in four roles across the collection, linking artworks to geography through the vocabulary database's mappings table. But the Rijksmuseum's source data provides coordinates for only a fraction of them. Most arrive as text labels with an authority URI (Wikidata, Getty TGN, or GeoNames) but no latitude or longitude.

How places connect to artworks

Production place707,428 links
Depicted place310,098 links
Creator birth place195,957 links
Creator death place180,653 links

A single place like "Amsterdam" may appear in all four roles across thousands of artworks. Production places are harvested from Linked Art production parts and OAI-PMH dcterms:spatial (unified in the DB as one field); depicted places from dc:subject; birth/death places from Linked Art person records.

Cascading authority strategy

Each place in the vocabulary database carries an external_id linking it to one or more authority databases. The geocoding pipeline resolves coordinates by querying these authorities in order of reliability, falling through to the next source when a query returns no result. Places with no authority link are sent by name to the Wikidata and WHG reconciliation APIs, which return fuzzy-matched candidates; a local scoring layer then combines string similarity, geographic type, and coordinate availability to accept, flag for review, or reject each match.

Source 1

Getty TGN

SPARQL batch queries against the Getty Thesaurus of Geographic Names. 200 IDs per request.

Source 2

Wikidata

SPARQL queries for P625 (coordinate location). 400 QIDs per batch. Fallback to P159, P131, P276.

Source 3

GeoNames

REST API lookups by GeoNames ID. One-by-one at 1 req/sec.

Source 4

WHG

Fuzzy name matching via World Historical Gazetteer. Resolves historical variants.

Fallback

Hierarchy

Places without coordinates inherit from their nearest geocoded parent via broader_id.

Share of resolution methods — v0.24 build (29,726 geocoded places; each attributed to the first cascade source that produced its coordinates. WHG adds zero in this cycle — all candidates routed to human review.)
Getty TGN 48%
Wikidata 27%
Hierarchy 20%
GN 5%

Authority sources in detail

Each source contributes coordinates with different strengths. The pipeline queries them in order of precision and batch efficiency.

Source Method Places Geocoded Strengths
Getty TGN SPARQL endpoint
foaf:focus → wgs84:lat/long
14,417 14,286 (99.1%) Art-world standard. Curated by Getty. Includes historical place names and hierarchies. Primary source in v0.24 — largest cross-reference pool after the Schema.org places dump.
Wikidata SPARQL endpoint
P625 + P159/P131/P276 fallbacks
8,442 8,375 (99.2%) Fallback properties (headquarters, admin territory, location) resolve institutions and districts that lack P625 directly.
GeoNames REST JSON API
by numeric ID
1,506 1,503 (99.8%) Good for modern administrative units. Free tier rate-limited (1,000 req/hour).
World Historical Gazetteer Reconciliation API
fuzzy name matching
3,789 0 (review queue) Resolves historical variants (Batavia→Jakarta, Leyden→Leiden). The v0.24 country-context filter routes every candidate to a human-review CSV; none are auto-merged into the production DB yet.
Hierarchy inheritance Recursive CTE
broader_id → parent coords
6,060 6,060 Streets, buildings, landmarks inherit city-level coordinates from parent place. 82% of places have a parent.

Geocoding pipeline phases

The geocoding script (geocode_places.py) runs post-harvest in six sequential phases, each targeting a different category of unresolved places.

Phase 1a

Authority ID resolution

Batch SPARQL queries to Getty TGN and Wikidata using existing authority IDs from the harvest. GeoNames IDs resolved via REST API. Handles ~20,000 places in minutes.

external_id: "http://vocab.getty.edu/tgn/7006952"
→ SPARQL → lat: 52.37, lon: 4.89 (Amsterdam)
Phase 1b

Wikidata alternative properties

When P625 (coordinates) is missing, follow indirect paths: P159 (headquarters location), P131 (located in admin territory), P276 (location). Catches institutions and districts.

"Rijksmuseum" (Q190804) — no P625
→ P131 → Amsterdam → 52.37°N, 4.89°E
Phase 1c

Cross-reference bridging

Map Getty TGN IDs to Wikidata via P1667 (TGN ID property), then query Wikidata for coordinates. Recovers places where TGN itself lacks coordinates but Wikidata has them.

TGN 7011405 → P1667 → Q727 (Amsterdam)
→ P625 → 52.37°N, 4.89°E
Phase 2

Self-reference resolution

Some vocabulary entries reference other entries that have already been geocoded. Copy coordinates from the resolved entry. Handles aliases and duplicate records.

"'s-Gravenhage" → references "Den Haag"
→ copy lat: 52.08, lon: 4.30
Phase 3

Fuzzy name reconciliation

Places with no authority ID are matched by name against Wikidata and the World Historical Gazetteer. Confidence scoring: auto-accept ≥0.85, flag 0.50–0.85 for review, reject <0.50.

"Amstelledamme" → WHG fuzzy → Amsterdam
score: 0.92 → auto-accept
Phase 4

Validation

Systematic checks for hemisphere errors, null island false positives (0°N, 0°E), latitude/longitude swaps, and authority misidentifications detected via production-place vs depicted-place distance outliers.

Tewkesbury → TGN 7821058 (Tasmania!)
→ flagged: 16,000 km from other English works
→ corrected to Gloucestershire

Place hierarchy and coordinate inheritance

The vocabulary database links places into a parent–child tree via broader_id, derived from the Rijksmuseum's place hierarchy (P89_falls_within in CIDOC-CRM). 82% of places have a parent. This hierarchy serves two purposes.

Coordinate inheritance

Streets, buildings, and landmarks that no authority can geocode directly inherit coordinates from their nearest geocoded ancestor. Precision drops to city-level, but coverage jumps significantly — in v0.24, direct authority lookups cover about 64% of places, with hierarchy inheritance lifting the overall total to 81%.

"Spin- en Nieuwe Werkhuis"
→ no coordinates in any authority
→ broader_id → Amsterdam
→ inherits 52.37°N, 4.89°E

Hierarchy expansion at query time

The expandPlaceHierarchy flag on search_artwork recursively expands a place to include all descendants — so "Netherlands" also returns artworks produced in Amsterdam, Delft, Haarlem, and every other Dutch location.

search_artwork(productionPlace: "Netherlands", expandPlaceHierarchy: true)
→ recursive CTE walks broader_id tree (max depth 10, limit 10K descendants)

Resolving historical place names

A collection spanning the 15th to 21st century uses place names from many eras. The World Historical Gazetteer is particularly valuable here — its datasets include Dutch colonial names, VOC-era toponyms, and medieval variants that modern gazetteers do not cover.

Historical name Modern name Resolved via Context
Batavia Jakarta WHG VOC headquarters, 1619–1942
Leyden Leiden WHG Historical English/Latin spelling
's-Gravenhage Den Haag / The Hague Wikidata Formal Dutch name
Cochin Kochi WHG VOC trading post, Kerala
Elmina Elmina WHG WIC fort, Gold Coast (Ghana)
Amstelledamme Amsterdam WHG Medieval Dutch

What geocoding enables at runtime

With coordinates in the database, rijksmuseum-mcp+ supports spatial queries that would be impossible with place names alone.

Proximity search

Find artworks produced in, depicting, or connected to places within a radius of a named location. Uses a custom Haversine distance function in SQLite with a bounding-box pre-filter for performance.

search_artwork(nearPlace: "Oude Kerk Amsterdam", nearPlaceRadius: 0.5)
→ artworks within 500m of the Oude Kerk

Multi-word place resolution

Ambiguous queries like "Paleis van Justitie Den Haag" are resolved via progressive token splitting: try the full string, then progressively drop right-side tokens, using the remainder as geographic context for disambiguation.

"Rapenburg" → 2 candidates
(Leiden: 52.16°N & Amsterdam: 52.37°N)
+ context "Leiden" → picks correct one

Geographic distance as a signal

Proximity queries double as a data quality tool: outlier distances between an artwork's production place and depicted place can surface authority misidentifications that would otherwise be invisible in text-only metadata.

SK-A-XXXX: produced in London,
depicts "Tewkesbury" → geocoded to Tasmania
→ 16,000 km distance flags the error

Configurable radius

Radii from 100 metres to 500 km. Useful for both fine-grained queries ("artworks from this street") and regional surveys ("production in the Rhine valley").

nearPlaceRadius: 0.1 → 100m (a building)
nearPlaceRadius: 25 → 25km (a city region)
nearPlaceRadius: 200 → 200km (a province)

Building coverage: from 0% to 81%

Each geocoded coordinate in v0.24 is tagged with a three-tier provenance label: authority (resolved directly against an external gazetteer), derived (inherited from a parent or alias within the vocabulary tree), or manual (hand-curated overrides). Authority lookups do most of the work; hierarchy inheritance closes most of the tail.

Total places
36,907
Authority
23,666  (64%)
external ID
Derived
6,060  (16%)
hierarchy + alias
Manual
0
curated
Total geocoded
29,726  (81%)
Remaining ungeocoded
7,181  (19%)

Authority tier, by gazetteer. Among the 23,666 authority-tier places, external-ID cross-references point to Getty TGN (14,286 places), Wikidata (8,375), and GeoNames (1,503). Counts overlap: many places carry IDs from more than one authority, and the cascade accepts whichever returns coordinates first. WHG contributions are resolved inline during the fuzzy-name phase and are not persisted as cross-references, so they do not appear in this breakdown.

The remaining 7,181 ungeocoded places are mostly orphan vocabulary entries not linked to any artwork, local landmarks outside every gazetteer's coverage, and entries whose authority IDs return no coordinates in any queried source.