Provenance Parser

Turning free-text ownership histories into structured data

The Rijksmuseum records the ownership history of ~48,000 artworks as free-text narratives following the AAM punctuation convention. The provenance parser in rijksmuseum-mcp+ parses these narratives into over 100,000 structured events, making them searchable by party, transfer type, date, location, and price.

48,313
artworks with provenance
100,727
events parsed
19
transfer types
10
currencies recognised
3
unresolved events

Processing pipeline

Each provenance narrative passes through a three-layer pipeline. Deterministic parsing handles ~97% of events; an LLM enrichment pass resolves the remainder.

Layer 1a

Preprocessing

Strip HTML, extract {citations}, split on semicolons, rejoin prices split across commas

Layer 1b

PEG Grammar

22 ordered-choice rules match event-type-specific patterns (sales, gifts, bequests, loans…)

Layer 1c

Regex Fallback

Keyword-priority classifier for segments the grammar can't match

Layer 2

Post-processing

Assign party positions (sender / receiver), transfer category (ownership / custody), ownership periods

Layer 3

LLM Enrichment

Classify residual unknowns, disambiguate merged parties, assign positions — with reasoning

Parse method distribution (100,727 events)
Parsing Expression Grammar (PEG)  58%
Regex  19%
Cross-ref  20%
LLM

Example: The Milkmaid (SK-A-2344)

A provenance narrative for one of the Rijksmuseum's most famous paintings, showing the data elements the parser extracts from each semicolon-delimited event.

Party
Role
Transfer type
Date
Location
Price
Lot / details
Citation
Gap
? party Pieter Claesz van Ruijven (1624-1674) and party Maria Simonsdr de Knuijt (1623-1681), loc Delft ;
? role their daughter, party Magdalena van Ruijven (1655-1682), loc Delft ;
? role her widower, party Jacob Dissius (1653-1695), loc Delft ;
cite {Montias 1989, pp. 246-262, 359, doc. 417.}
role his type sale, loc Amsterdam, date 16 May 1696, lot no. 2, price fl. 175, to party Isaac Rooleeuw (1663-1701), loc Amsterdam ;
gap
party Lucretia Johanna van Winter (1785-1845), loc Amsterdam, date 1817 ;
role from whom type purchased by the party Rijksmuseum, loc Amsterdam, date 1908, price fl. 550,000 with batch 38 other paintings

Abbreviated for display. Full narrative has 16 events spanning 1624–1908. The ? prefix marks uncertain attributions; marks a gap in the documented chain.

What the parser extracts

Each provenance event is decomposed into structured fields, stored in a SQLite database, and exposed through the search_provenance tool.

Transfer type

How ownership changed: sale, gift, bequest, inheritance, confiscation, loan, and 13 other types aligned with the CMOA vocabulary.

"his sale, Amsterdam" → sale
"donated to the museum" → gift
"her widower, Jacob…" → widowhood

Parties & roles

People and institutions involved, with life dates, roles (buyer, seller, heir, donor…) and inferred position: sender, receiver, or agent.

"to Isaac Rooleeuw (1663-1701)"
→ name: Isaac Rooleeuw
→ dates: 1663–1701, role: buyer
→ position: receiver

Dates & qualifiers

Event year with qualifier: before, after, circa (±10 years), or exact. Full date expressions preserved.

"16 May 1696" → year: 1696, qualifier: null
"c. 1817" → year: 1817, qualifier: circa
"before 1674" → year: 1674, qualifier: before

Locations

Place names matched against 27,237 verified toponyms from the vocabulary database. 64% are geocoded for proximity search.

"Amsterdam" → loc: Amsterdam (52.37°N, 4.89°E)
"Delft" → loc: Delft (52.01°N, 4.36°E)
¤

Prices & currencies

Amounts in original historical currency. Ten currencies recognised: guilders, pounds, francs, marks, euros, dollars, lire, yen, kronen, reis. A batch_price flag marks en bloc totals for multiple works.

"fl. 175" → 175 guilders
"fl. 550,000 with 38 other paintings"
→ 550,000 guilders, batch_price: true

Sale details & citations

Lot numbers, auctioneer names, and bibliographic references extracted from {curly braces} and stored separately.

"no. 2" → lot number: 2
"{Montias 1989, pp. 246-262}"
→ citation (linked to event)

Provenance signals

Boolean flags and metadata that support filtering and research workflows:

uncertain? prefix on event or party
gap break in ownership chain
unsold — auction lot was bought-in (no transfer occurred)
batch_price — price covers multiple works
parseMethod — peg, regex_fallback, cross_ref, or credit_line
categoryMethod / positionMethod — provenance of every enrichment decision

Parsed output for the Milkmaid (excerpt)

The first five events as structured rows. Each row is one semicolon-delimited segment from the raw text.

# Type Party Position Location Date Price Method
1 collection Pieter Claesz van Ruijven (1624-1674)
Maria Simonsdr de Knuijt (1623-1681)
receiver
receiver
Delft peg
2 by_descent Magdalena van Ruijven (1655-1682) receiver Delft peg
3 widowhood Jacob Dissius (1653-1695) receiver Delft peg
4 sale Isaac Rooleeuw (1663-1701) receiver Amsterdam 16 May 1696 fl. 175 peg
gap in the documented ownership chain
5 collection Lucretia Johanna van Winter (1785-1845) receiver Amsterdam 1817 peg

Transfer type vocabulary

19 types aligned with the CMOA Art Tracks standard, organised by whether they transfer ownership (legal title) or only custody (temporary possession).

Ownership transfers

sale12,886
collection17,883
gift10,663
by_descent9,344
bequest4,378
inheritance4,366
widowhood3,423
commission574
recuperation852
restitution86
inventory75
confiscation21
exchange4
theft / looting5

Custody transfers

loan6,333
deposit246

Ambiguous / unresolved

transfer (administrative)6,052
unknown (incl. cross-refs)21,436

The 21,436 unknown events include 19,871 cross-references ("see provenance of SK-A-XXXX") that link to companion works and are not true unknowns. After excluding cross-references, only 3 events remain genuinely unresolved.

Where the parser needed help from an LLM

The deterministic parser (Parsing Expression Grammar + regex fallback + rule-based inference) resolved ~97% of events. For the remaining cases, an LLM (Claude Sonnet) was used in targeted batch passes to resolve three categories of ambiguity that require world knowledge or contextual reasoning.

Three problems that required an LLM

1. Transfer type classification

170 events used institution-specific jargon, archaic terminology, or non-standard phrasing that no keyword rule could reliably match. The LLM classified each with a transfer type and written reasoning. Cost: $0.73.

"placed at the disposal of the SNK"
→ LLM: deposit
Reasoning: "SNK administered wartime
deposits; this is custody, not ownership"
2. Party position disambiguation

824 parties had no deterministic role mapping. The LLM inferred sender/receiver/agent from the narrative context and the transfer type of the event. Cost: $9.03.

"with dealer [Name], to the museum"
→ LLM: [Name] = sender
Reasoning: "dealer is the intermediary
selling to the museum (receiver)"
3. Merged party decomposition

367 party texts contained multiple people merged into a single string by the parser. The LLM decomposed these into individual parties with correct positions. Cost: $2.31.

"Jan de Vos, his son Pieter de Vos"
→ LLM: split into 2 parties
Jan de Vos (sender),
Pieter de Vos (receiver)

From 3,192 unknowns to 3

In the end, a series of grammar improvements, structural signal detection, deterministic rules, and targeted LLM passes reduced the (non-cross-reference) unknowns by 99.9%.

Starting unknowns
3,192
Grammar improvements
−1,890
PEG rules
Structural signals
−832
regex
Unsold reclassification
−230
rule
Credit-line events
−70
rule
Transfer type classification
−170
LLM
Remaining
3