The Rijksmuseum records the ownership history of ~48,000 artworks as free-text narratives
following the AAM
punctuation convention. The provenance parser in
rijksmuseum-mcp+
parses these narratives into over 100,000 structured events, making them searchable
by party, transfer type, date, location, and price. Every extracted event and party carries a three-tier
method tag (peg /
rule /
llm) so its source can be audited.
Each provenance narrative passes through a three-layer pipeline. Deterministic parsing handles ~97% of events; an LLM enrichment pass resolves the remainder.
Strip HTML, extract {citations}, split on semicolons, rejoin prices split across commas
22 ordered-choice rules match event-type-specific patterns (sales, gifts, bequests, loans…)
Keyword-priority classifier for segments the grammar can't match
Assign party positions (sender / receiver), transfer category (ownership / custody), ownership periods
Classify residual unknowns, disambiguate merged parties, assign positions — with reasoning
A provenance narrative for one of the Rijksmuseum's most famous paintings, showing the data elements the parser extracts from each semicolon-delimited event.
Abbreviated for display. Full narrative has 16 events spanning 1624–1908.
The ? prefix marks uncertain attributions; … marks a gap in the documented chain.
Each provenance event is decomposed into structured fields, stored in a SQLite database, and
exposed through the search_provenance tool.
How ownership changed: sale, gift, bequest, inheritance, confiscation, loan, and 13 other types aligned with the CMOA vocabulary.
People and institutions involved, with life dates, roles (buyer, seller, heir, donor…) and inferred position: sender, receiver, or agent.
Event year with qualifier: before, after, circa (±10 years), or exact. Full date expressions preserved.
Place names matched against the vocabulary database's place terms. 81% of known places are geocoded, enabling proximity search and region-based queries.
Amounts in original historical currency. 14 currencies recognised — dominant ones are guilders,
euros, pounds, francs, yen, and deutschmarks; smaller tails for dollars, Swiss francs, guineas,
reichsmarks, livres, marks, napoléons, and Belgian francs. A batch_price flag marks
en bloc totals for multiple works.
Lot numbers, auctioneer names, and bibliographic references extracted from
{curly braces} and stored separately.
Boolean flags and metadata that support filtering and research workflows:
uncertain — ? prefix on event or partygap — … break in ownership chainunsold — auction lot was bought-in (no transfer occurred)batch_price — price covers multiple worksparseMethod — peg, cross_ref, llm_structural, or credit_linecategoryMethod / positionMethod — provenance of every enrichment decisionThe first five events as structured rows. Each row is one semicolon-delimited segment from the raw text. Where a segment names two co-owners in parentheses (e.g. “X (dates) and Y (dates)”), the grammar currently captures the first party only — a known limitation.
| # | Type | Party | Position | Location | Date | Price | Method |
|---|---|---|---|---|---|---|---|
| 1 | collection | Pieter Claesz van Ruijven (1624-1674) | receiver | Delft | — | — | peg |
| 2 | by_descent | Magdalena van Ruijven (1655-1682) | receiver | Delft | — | — | peg |
| 3 | widowhood | Jacob Dissius (1653-1695) | receiver | Delft | — | — | peg |
| 4 | sale | Isaac Rooleeuw (1663-1701) | receiver | Amsterdam | 16 May 1696 | fl. 175 | peg |
| … | gap in the documented ownership chain | ||||||
| 5 | collection | Lucretia Johanna van Winter (1785-1845) | receiver | Amsterdam | 1817 | — | peg |
19 types aligned with the CMOA Art Tracks standard, organised by whether they transfer ownership (legal title) or only custody (temporary possession).
The 19,879 unknown events include 19,872 cross-references ("see provenance of SK-A-XXXX")
that link to companion works and are not true unknowns. After excluding cross-references, only 7 events remain
genuinely unresolved.
The deterministic parser (Parsing Expression Grammar + regex fallback + rule-based inference) resolved ~97% of events. For the remaining cases, an LLM (Claude Sonnet) was used in targeted batch passes to resolve three categories of ambiguity that require world knowledge or contextual reasoning.
166 events used institution-specific jargon, archaic terminology, or non-standard phrasing that no keyword rule could reliably match. The LLM classified each with a transfer type and written reasoning.
1,276 parties had no deterministic role mapping. The LLM inferred sender/receiver/agent from the narrative context and the transfer type of the event.
439 party records were structurally split from merged text (e.g. “X, his son Y”) and reassigned positions by the LLM, which inferred each person's role from the narrative.
In the end, a series of grammar improvements, structural signal detection, deterministic rules, and targeted LLM passes reduced the (non-cross-reference) unknowns by 99.8%.