The Rijksmuseum records the ownership history of ~48,000 artworks as free-text narratives following the AAM punctuation convention. The provenance parser in rijksmuseum-mcp+ parses these narratives into over 100,000 structured events, making them searchable by party, transfer type, date, location, and price.
Each provenance narrative passes through a three-layer pipeline. Deterministic parsing handles ~97% of events; an LLM enrichment pass resolves the remainder.
Strip HTML, extract {citations}, split on semicolons, rejoin prices split across commas
22 ordered-choice rules match event-type-specific patterns (sales, gifts, bequests, loans…)
Keyword-priority classifier for segments the grammar can't match
Assign party positions (sender / receiver), transfer category (ownership / custody), ownership periods
Classify residual unknowns, disambiguate merged parties, assign positions — with reasoning
A provenance narrative for one of the Rijksmuseum's most famous paintings, showing the data elements the parser extracts from each semicolon-delimited event.
Abbreviated for display. Full narrative has 16 events spanning 1624–1908.
The ? prefix marks uncertain attributions; … marks a gap in the documented chain.
Each provenance event is decomposed into structured fields, stored in a SQLite database, and
exposed through the search_provenance tool.
How ownership changed: sale, gift, bequest, inheritance, confiscation, loan, and 13 other types aligned with the CMOA vocabulary.
People and institutions involved, with life dates, roles (buyer, seller, heir, donor…) and inferred position: sender, receiver, or agent.
Event year with qualifier: before, after, circa (±10 years), or exact. Full date expressions preserved.
Place names matched against 27,237 verified toponyms from the vocabulary database. 64% are geocoded for proximity search.
Amounts in original historical currency. Ten currencies recognised: guilders, pounds, francs,
marks, euros, dollars, lire, yen, kronen, reis. A batch_price flag marks
en bloc totals for multiple works.
Lot numbers, auctioneer names, and bibliographic references extracted from
{curly braces} and stored separately.
Boolean flags and metadata that support filtering and research workflows:
uncertain — ? prefix on event or partygap — … break in ownership chainunsold — auction lot was bought-in (no transfer occurred)batch_price — price covers multiple worksparseMethod — peg, regex_fallback, cross_ref, or credit_linecategoryMethod / positionMethod — provenance of every enrichment decisionThe first five events as structured rows. Each row is one semicolon-delimited segment from the raw text.
| # | Type | Party | Position | Location | Date | Price | Method |
|---|---|---|---|---|---|---|---|
| 1 | collection | Pieter Claesz van Ruijven (1624-1674) Maria Simonsdr de Knuijt (1623-1681) |
receiver receiver |
Delft | — | — | peg |
| 2 | by_descent | Magdalena van Ruijven (1655-1682) | receiver | Delft | — | — | peg |
| 3 | widowhood | Jacob Dissius (1653-1695) | receiver | Delft | — | — | peg |
| 4 | sale | Isaac Rooleeuw (1663-1701) | receiver | Amsterdam | 16 May 1696 | fl. 175 | peg |
| … | gap in the documented ownership chain | ||||||
| 5 | collection | Lucretia Johanna van Winter (1785-1845) | receiver | Amsterdam | 1817 | — | peg |
19 types aligned with the CMOA Art Tracks standard, organised by whether they transfer ownership (legal title) or only custody (temporary possession).
The 21,436 unknown events include 19,871 cross-references ("see provenance of SK-A-XXXX")
that link to companion works and are not true unknowns. After excluding cross-references, only 3 events remain genuinely unresolved.
The deterministic parser (Parsing Expression Grammar + regex fallback + rule-based inference) resolved ~97% of events. For the remaining cases, an LLM (Claude Sonnet) was used in targeted batch passes to resolve three categories of ambiguity that require world knowledge or contextual reasoning.
170 events used institution-specific jargon, archaic terminology, or non-standard phrasing that no keyword rule could reliably match. The LLM classified each with a transfer type and written reasoning. Cost: $0.73.
824 parties had no deterministic role mapping. The LLM inferred sender/receiver/agent from the narrative context and the transfer type of the event. Cost: $9.03.
367 party texts contained multiple people merged into a single string by the parser. The LLM decomposed these into individual parties with correct positions. Cost: $2.31.
In the end, a series of grammar improvements, structural signal detection, deterministic rules, and targeted LLM passes reduced the (non-cross-reference) unknowns by 99.9%.