Place Geocoding

Adding coordinates to 32,000 place names from five centuries of art history

The Rijksmuseum's collection metadata references over 32,000 distinct places — production sites, depicted locations, and artist birthplaces spanning five continents and five centuries. Most arrive as bare names without coordinates. rijksmuseum-mcp+ geocodes these places by cascading through multiple authority databases, enabling proximity search, geographic analysis, and spatial reasoning over the entire collection.

32,347
distinct places
31,033
geocoded (96%)
5
authority sources
485K
artwork–place links

The problem

Places appear in four roles across the collection, linking artworks to geography through the vocabulary database's mappings table. But the Rijksmuseum's source data provides coordinates for only a fraction of them. Most arrive as text labels with an authority URI (Wikidata, Getty TGN, or GeoNames) but no latitude or longitude.

How places connect to artworks

Production place485,429 links
Depicted place309,502 links
Creator birth place196,234 links
Creator death place180,185 links

A single place like "Amsterdam" may appear in all four roles across thousands of artworks. Production places come from OAI-PMH dcterms:spatial; depicted places from dc:subject; birth/death places from Linked Art person records.

Cascading authority strategy

Each place in the vocabulary database carries an external_id linking it to one or more authority databases. The geocoding pipeline resolves coordinates by querying these authorities in order of reliability, falling through to the next source when a query returns no result. Places with no authority link are sent by name to the Wikidata and WHG reconciliation APIs, which return fuzzy-matched candidates; a local scoring layer then combines string similarity, geographic type, and coordinate availability to accept, flag for review, or reject each match.

Source 1

Getty TGN

SPARQL batch queries against the Getty Thesaurus of Geographic Names. 200 IDs per request.

Source 2

Wikidata

SPARQL queries for P625 (coordinate location). 400 QIDs per batch. Fallback to P159, P131, P276.

Source 3

GeoNames

REST API lookups by GeoNames ID. One-by-one at 1 req/sec.

Source 4

WHG

Fuzzy name matching via World Historical Gazetteer. Resolves historical variants.

Fallback

Hierarchy

Places without coordinates inherit from their nearest geocoded parent via broader_id.

Coordinate sources across 31,033 geocoded places
Wikidata 33%
Hierarchy 31%
Getty TGN 29%
GN
WHG

Authority sources in detail

Each source contributes coordinates with different strengths. The pipeline queries them in order of precision and batch efficiency.

Source Method Places Geocoded Strengths
Getty TGN SPARQL endpoint
foaf:focus → wgs84:lat/long
8,884 8,882 (99.98%) Art-world standard. Curated by Getty. Includes historical place names and hierarchies.
Wikidata SPARQL endpoint
P625 + P159/P131/P276 fallbacks
10,308 10,264 (99.6%) Largest coverage. Fallback properties (headquarters, admin territory) resolve institutions and districts.
GeoNames REST JSON API
by numeric ID
1,215 1,201 (98.8%) Good for modern administrative units. Free tier rate-limited (1,000 req/hour).
World Historical Gazetteer Reconciliation API
fuzzy name matching
~2,283 ~2,283 (accepted) Resolves historical variants (Batavia→Jakarta, Leyden→Leiden). Bridges to other authorities.
Hierarchy inheritance Recursive CTE
broader_id → parent coords
~9,460 ~9,460 Streets, buildings, landmarks inherit city-level coordinates from parent place. 79% of places have a parent.

Geocoding pipeline phases

The geocoding script (geocode_places.py) runs post-harvest in six sequential phases, each targeting a different category of unresolved places.

Phase 1a

Authority ID resolution

Batch SPARQL queries to Getty TGN and Wikidata using existing authority IDs from the harvest. GeoNames IDs resolved via REST API. Handles ~20,000 places in minutes.

external_id: "http://vocab.getty.edu/tgn/7006952"
→ SPARQL → lat: 52.37, lon: 4.89 (Amsterdam)
Phase 1b

Wikidata alternative properties

When P625 (coordinates) is missing, follow indirect paths: P159 (headquarters location), P131 (located in admin territory), P276 (location). Catches institutions and districts.

"Rijksmuseum" (Q190804) — no P625
→ P131 → Amsterdam → 52.37°N, 4.89°E
Phase 1c

Cross-reference bridging

Map Getty TGN IDs to Wikidata via P1667 (TGN ID property), then query Wikidata for coordinates. Recovers places where TGN itself lacks coordinates but Wikidata has them.

TGN 7011405 → P1667 → Q727 (Amsterdam)
→ P625 → 52.37°N, 4.89°E
Phase 2

Self-reference resolution

Some vocabulary entries reference other entries that have already been geocoded. Copy coordinates from the resolved entry. Handles aliases and duplicate records.

"'s-Gravenhage" → references "Den Haag"
→ copy lat: 52.08, lon: 4.30
Phase 3

Fuzzy name reconciliation

Places with no authority ID are matched by name against Wikidata and the World Historical Gazetteer. Confidence scoring: auto-accept ≥0.85, flag 0.50–0.85 for review, reject <0.50.

"Amstelledamme" → WHG fuzzy → Amsterdam
score: 0.92 → auto-accept
Phase 4

Validation

Systematic checks for hemisphere errors, null island false positives (0°N, 0°E), latitude/longitude swaps, and authority misidentifications detected via production-place vs depicted-place distance outliers.

Tewkesbury → TGN 7821058 (Tasmania!)
→ flagged: 16,000 km from other English works
→ corrected to Gloucestershire

Place hierarchy and coordinate inheritance

The vocabulary database links places into a parent–child tree via broader_id, derived from the Rijksmuseum's place hierarchy (P89_falls_within in CIDOC-CRM). 79% of places have a parent. This hierarchy serves two purposes.

Coordinate inheritance

Streets, buildings, and landmarks that no authority can geocode directly inherit coordinates from their nearest geocoded ancestor. Precision drops to city-level, but coverage jumps from 64% to 96%.

"Spin- en Nieuwe Werkhuis"
→ no coordinates in any authority
→ broader_id → Amsterdam
→ inherits 52.37°N, 4.89°E

Hierarchy expansion at query time

The expandPlaceHierarchy flag on search_artwork recursively expands a place to include all descendants — so "Netherlands" also returns artworks produced in Amsterdam, Delft, Haarlem, and every other Dutch location.

search_artwork(productionPlace: "Netherlands", expandPlaceHierarchy: true)
→ recursive CTE walks broader_id tree (max depth 10, limit 10K descendants)

Resolving historical place names

A collection spanning the 15th to 21st century uses place names from many eras. The World Historical Gazetteer is particularly valuable here — its datasets include Dutch colonial names, VOC-era toponyms, and medieval variants that modern gazetteers do not cover.

Historical name Modern name Resolved via Context
Batavia Jakarta WHG VOC headquarters, 1619–1942
Leyden Leiden WHG Historical English/Latin spelling
's-Gravenhage Den Haag / The Hague Wikidata Formal Dutch name
Cochin Kochi WHG VOC trading post, Kerala
Elmina Elmina WHG WIC fort, Gold Coast (Ghana)
Amstelledamme Amsterdam WHG Medieval Dutch

What geocoding enables at runtime

With coordinates in the database, rijksmuseum-mcp+ supports spatial queries that would be impossible with place names alone.

Proximity search

Find artworks produced in, depicting, or connected to places within a radius of a named location. Uses a custom Haversine distance function in SQLite with a bounding-box pre-filter for performance.

search_artwork(nearPlace: "Oude Kerk Amsterdam", nearPlaceRadius: 0.5)
→ artworks within 500m of the Oude Kerk

Multi-word place resolution

Ambiguous queries like "Paleis van Justitie Den Haag" are resolved via progressive token splitting: try the full string, then progressively drop right-side tokens, using the remainder as geographic context for disambiguation.

"Rapenburg" → 2 candidates
(Leiden: 52.16°N & Amsterdam: 52.37°N)
+ context "Leiden" → picks correct one

Geographic distance as a signal

Proximity queries double as a data quality tool: outlier distances between an artwork's production place and depicted place can surface authority misidentifications that would otherwise be invisible in text-only metadata.

SK-A-XXXX: produced in London,
depicts "Tewkesbury" → geocoded to Tasmania
→ 16,000 km distance flags the error

Configurable radius

Radii from 100 metres to 500 km. Useful for both fine-grained queries ("artworks from this street") and regional surveys ("production in the Rhine valley").

nearPlaceRadius: 0.1 → 100m (a building)
nearPlaceRadius: 25 → 25km (a city region)
nearPlaceRadius: 200 → 200km (a province)

Building coverage: from 0% to 96%

Each phase of the pipeline adds coordinates to a new tranche of places. Authority ID resolution handles the bulk; hierarchy inheritance closes the long tail.

Total places
32,347
Wikidata P625
10,264
SPARQL
Getty TGN
8,882
SPARQL
WHG fuzzy + bridge
2,283
API
GeoNames
1,201
REST API
Subtotal (direct)
22,630  (70%)
Hierarchy inheritance
~8,400
broader_id
Total geocoded
31,033  (96%)
Remaining ungeocoded
1,314

The 1,314 ungeocoded places are primarily orphan vocabulary entries (not linked to any artwork), extremely local landmarks, and a small number of places with authority IDs but no coordinates in any source.

Validation and edge cases

Geocoding at this scale surfaces errors that are invisible in text-only metadata. Phase 4 runs systematic validation checks; several caught real errors in the source data.

Errors caught by validation

Authority misidentification

Tewkesbury (England) linked to Getty TGN 7821058 — which is Tewkesbury, Tasmania. Invisible without coordinates; detected by distance outlier analysis (16,000 km from co-occurring English places).

Hemisphere and sign errors

6 Caribbean/Dutch locations had latitude/longitude swaps placing them near the Falkland Islands. 2 negative-latitude signs inverted Northern Hemisphere locations to the Southern. All 8 caught by systematic hemisphere checks.

Null Island false positives

Places geocoded to 0°N, 0°E (a point in the Gulf of Guinea) indicate a missing-data default, not a real location. Flagged and excluded.

Celestial false positive

A vocabulary entry for "Moon" was geocoded to Moon Township, Pennsylvania. Manually corrected.