Provenance Parser

Converts semi-structured, free-text ownership histories into structured data

The Rijksmuseum records the ownership history of ~48,000 artworks as free-text narratives following the AAM punctuation convention. The provenance parser in rijksmuseum-mcp+ parses these narratives into over 100,000 structured events, making them searchable by party, transfer type, date, location, and price. Every extracted event and party carries a three-tier method tag (peg / rule / llm) so its source can be audited.

48,539
artworks with provenance
101,092
events parsed
20
transfer types
14
currencies recognised
7
unresolved events

Processing pipeline

Each provenance narrative passes through a three-layer pipeline. Deterministic parsing handles ~97% of events; an LLM enrichment pass resolves the remainder.

Layer 1a

Preprocessing

Strip HTML, extract {citations}, split on semicolons, rejoin prices split across commas

Layer 1b

PEG Grammar

22 ordered-choice rules match event-type-specific patterns (sales, gifts, bequests, loans…)

Layer 1c

Regex Fallback

Keyword-priority classifier for segments the grammar can't match

Layer 2

Post-processing

Assign party positions (sender / receiver), transfer category (ownership / custody), ownership periods

Layer 3

LLM Enrichment

Classify residual unknowns, disambiguate merged parties, assign positions — with reasoning

Parse method distribution (101,092 events)
Parsing Expression Grammar (PEG)  80%
Cross-ref  20%
LLM

Example: The Milkmaid (SK-A-2344)

A provenance narrative for one of the Rijksmuseum's most famous paintings, showing the data elements the parser extracts from each semicolon-delimited event.

Party
Role
Transfer type
Date
Location
Price
Lot / details
Citation
Gap
? party Pieter Claesz van Ruijven (1624-1674) and party Maria Simonsdr de Knuijt (1623-1681), loc Delft ;
? role their daughter, party Magdalena van Ruijven (1655-1682), loc Delft ;
? role her widower, party Jacob Dissius (1653-1695), loc Delft ;
cite {Montias 1989, pp. 246-262, 359, doc. 417.}
role his type sale, loc Amsterdam, date 16 May 1696, lot no. 2, price fl. 175, to party Isaac Rooleeuw (1663-1701), loc Amsterdam ;
gap
party Lucretia Johanna van Winter (1785-1845), loc Amsterdam, date 1817 ;
role from whom type purchased by the party Rijksmuseum, loc Amsterdam, date 1908, price fl. 550,000 with batch 38 other paintings

Abbreviated for display. Full narrative has 16 events spanning 1624–1908. The ? prefix marks uncertain attributions; marks a gap in the documented chain.

What the parser extracts

Each provenance event is decomposed into structured fields, stored in a SQLite database, and exposed through the search_provenance tool.

Transfer type

How ownership changed: sale, gift, bequest, inheritance, confiscation, loan, and 13 other types aligned with the CMOA vocabulary.

"his sale, Amsterdam" → sale
"donated to the museum" → gift
"her widower, Jacob…" → widowhood

Parties & roles

People and institutions involved, with life dates, roles (buyer, seller, heir, donor…) and inferred position: sender, receiver, or agent.

"to Isaac Rooleeuw (1663-1701)"
→ name: Isaac Rooleeuw
→ dates: 1663–1701, role: buyer
→ position: receiver

Dates & qualifiers

Event year with qualifier: before, after, circa (±10 years), or exact. Full date expressions preserved.

"16 May 1696" → year: 1696, qualifier: null
"c. 1817" → year: 1817, qualifier: circa
"before 1674" → year: 1674, qualifier: before

Locations

Place names matched against the vocabulary database's place terms. 81% of known places are geocoded, enabling proximity search and region-based queries.

"Amsterdam" → loc: Amsterdam (52.37°N, 4.89°E)
"Delft" → loc: Delft (52.01°N, 4.36°E)
¤

Prices & currencies

Amounts in original historical currency. 14 currencies recognised — dominant ones are guilders, euros, pounds, francs, yen, and deutschmarks; smaller tails for dollars, Swiss francs, guineas, reichsmarks, livres, marks, napoléons, and Belgian francs. A batch_price flag marks en bloc totals for multiple works.

"fl. 175" → 175 guilders
"fl. 550,000 with 38 other paintings"
→ 550,000 guilders, batch_price: true

Sale details & citations

Lot numbers, auctioneer names, and bibliographic references extracted from {curly braces} and stored separately.

"no. 2" → lot number: 2
"{Montias 1989, pp. 246-262}"
→ citation (linked to event)

Provenance signals

Boolean flags and metadata that support filtering and research workflows:

uncertain? prefix on event or party
gap break in ownership chain
unsold — auction lot was bought-in (no transfer occurred)
batch_price — price covers multiple works
parseMethod — peg, cross_ref, llm_structural, or credit_line
categoryMethod / positionMethod — provenance of every enrichment decision

Parsed output for the Milkmaid (excerpt)

The first five events as structured rows. Each row is one semicolon-delimited segment from the raw text. Where a segment names two co-owners in parentheses (e.g. “X (dates) and Y (dates)”), the grammar currently captures the first party only — a known limitation.

# Type Party Position Location Date Price Method
1 collection Pieter Claesz van Ruijven (1624-1674) receiver Delft peg
2 by_descent Magdalena van Ruijven (1655-1682) receiver Delft peg
3 widowhood Jacob Dissius (1653-1695) receiver Delft peg
4 sale Isaac Rooleeuw (1663-1701) receiver Amsterdam 16 May 1696 fl. 175 peg
gap in the documented ownership chain
5 collection Lucretia Johanna van Winter (1785-1845) receiver Amsterdam 1817 peg

Transfer type vocabulary

19 types aligned with the CMOA Art Tracks standard, organised by whether they transfer ownership (legal title) or only custody (temporary possession).

Ownership transfers

sale15,657
collection18,561
gift10,897
by_descent13,663
bequest4,378
widowhood3,455
commission580
recuperation853
restitution86
inventory122
inheritance41
confiscation19
looting5
exchange5
theft4

Custody transfers

loan6,301
deposit244

Other / unresolved

transfer (administrative)6,275
non_provenance (Step 7 reclassification)67
unknown (incl. cross-refs)19,879

The 19,879 unknown events include 19,872 cross-references ("see provenance of SK-A-XXXX") that link to companion works and are not true unknowns. After excluding cross-references, only 7 events remain genuinely unresolved.

Where the parser needed help from an LLM

The deterministic parser (Parsing Expression Grammar + regex fallback + rule-based inference) resolved ~97% of events. For the remaining cases, an LLM (Claude Sonnet) was used in targeted batch passes to resolve three categories of ambiguity that require world knowledge or contextual reasoning.

Three problems that required an LLM

1. Transfer type classification

166 events used institution-specific jargon, archaic terminology, or non-standard phrasing that no keyword rule could reliably match. The LLM classified each with a transfer type and written reasoning.

"placed at the disposal of the SNK"
→ LLM: deposit
Reasoning: "SNK administered wartime
deposits; this is custody, not ownership"
2. Party position disambiguation

1,276 parties had no deterministic role mapping. The LLM inferred sender/receiver/agent from the narrative context and the transfer type of the event.

"with dealer [Name], to the museum"
→ LLM: [Name] = sender
Reasoning: "dealer is the intermediary
selling to the museum (receiver)"
3. Merged party decomposition

439 party records were structurally split from merged text (e.g. “X, his son Y”) and reassigned positions by the LLM, which inferred each person's role from the narrative.

"Jan de Vos, his son Pieter de Vos"
→ LLM: split into 2 parties
Jan de Vos (sender),
Pieter de Vos (receiver)

From 3,204 unknowns to 7

In the end, a series of grammar improvements, structural signal detection, deterministic rules, and targeted LLM passes reduced the (non-cross-reference) unknowns by 99.8%.

Starting unknowns
3,204
Grammar improvements
−1,890
PEG rules
Structural signals
−832
regex
Unsold reclassification
−230
rule
Credit-line events
−79
rule
Transfer type classification
−166
LLM
Remaining
7