Provenance Parser

Converts semi-structured, free-text ownership histories into structured data

The Rijksmuseum records the ownership history of ~48,000 artworks as free-text narratives following the AAM punctuation convention. The provenance parser in rijksmuseum-mcp+ parses these narratives into over 100,000 structured events, making them searchable by party, transfer type, date, location, and price. Every extracted event and party carries a three-tier method tag (peg / rule / llm) so its source can be audited.

48,539

artworks with provenance

101,092

events parsed

transfer types

currencies recognised

unresolved events

Processing pipeline

Each provenance narrative passes through a three-layer pipeline. Deterministic parsing handles ~97% of events; an LLM enrichment pass resolves the remainder.

Layer 1a

Preprocessing

Strip HTML, extract {citations}, split on semicolons, rejoin prices split across commas

→

Layer 1b

PEG Grammar

22 ordered-choice rules match event-type-specific patterns (sales, gifts, bequests, loans…)

→

Layer 1c

Regex Fallback

Keyword-priority classifier for segments the grammar can't match

→

Layer 2

Post-processing

Assign party positions (sender / receiver), transfer category (ownership / custody), ownership periods

→

Layer 3

LLM Enrichment

Classify residual unknowns, disambiguate merged parties, assign positions — with reasoning

Parse method distribution (101,092 events)

Parsing Expression Grammar (PEG) 80%

Cross-ref 20%

LLM

Example: The Milkmaid (SK-A-2344)

A provenance narrative for one of the Rijksmuseum's most famous paintings, showing the data elements the parser extracts from each semicolon-delimited event.

Party

Role

Transfer type

Date

Location

Price

Lot / details

Citation

Gap

? party Pieter Claesz van Ruijven (1624-1674) and party Maria Simonsdr de Knuijt (1623-1681), loc Delft ;
? role their daughter, party Magdalena van Ruijven (1655-1682), loc Delft ;
? role her widower, party Jacob Dissius (1653-1695), loc Delft ;
cite {Montias 1989, pp. 246-262, 359, doc. 417.}
role his type sale, loc Amsterdam, date 16 May 1696, lot no. 2, price fl. 175, to party Isaac Rooleeuw (1663-1701), loc Amsterdam ;
gap …
party Lucretia Johanna van Winter (1785-1845), loc Amsterdam, date 1817 ;
role from whom type purchased by the party Rijksmuseum, loc Amsterdam, date 1908, price fl. 550,000 with batch 38 other paintings

Abbreviated for display. Full narrative has 16 events spanning 1624–1908. The ? prefix marks uncertain attributions; … marks a gap in the documented chain.

What the parser extracts

Each provenance event is decomposed into structured fields, stored in a SQLite database, and exposed through the search_provenance tool.

↻

Transfer type

How ownership changed: sale, gift, bequest, inheritance, confiscation, loan, and 13 other types aligned with the CMOA vocabulary.

"his sale, Amsterdam" → sale
"donated to the museum" → gift
"her widower, Jacob…" → widowhood

★

Parties & roles

People and institutions involved, with life dates, roles (buyer, seller, heir, donor…) and inferred position: sender, receiver, or agent.

"to Isaac Rooleeuw (1663-1701)"
→ name: Isaac Rooleeuw
→ dates: 1663–1701, role: buyer
→ position: receiver

∅

Dates & qualifiers

Event year with qualifier: before, after, circa (±10 years), or exact. Full date expressions preserved.

"16 May 1696" → year: 1696, qualifier: null
"c. 1817" → year: 1817, qualifier: circa
"before 1674" → year: 1674, qualifier: before

⊕

Locations

Place names matched against the vocabulary database's place terms. 81% of known places are geocoded, enabling proximity search and region-based queries.

"Amsterdam" → loc: Amsterdam (52.37°N, 4.89°E)
"Delft" → loc: Delft (52.01°N, 4.36°E)

Prices & currencies

Amounts in original historical currency. 14 currencies recognised — dominant ones are guilders, euros, pounds, francs, yen, and deutschmarks; smaller tails for dollars, Swiss francs, guineas, reichsmarks, livres, marks, napoléons, and Belgian francs. A batch_price flag marks en bloc totals for multiple works.

"fl. 175" → 175 guilders
"fl. 550,000 with 38 other paintings"
→ 550,000 guilders, batch_price: true

Sale details & citations

Lot numbers, auctioneer names, and bibliographic references extracted from {curly braces} and stored separately.

"no. 2" → lot number: 2
"{Montias 1989, pp. 246-262}"
→ citation (linked to event)

▿

Provenance signals

Boolean flags and metadata that support filtering and research workflows:

uncertain — ? prefix on event or party

gap — … break in ownership chain

unsold — auction lot was bought-in (no transfer occurred)

batch_price — price covers multiple works

parseMethod — peg, cross_ref, llm_structural, or credit_line

categoryMethod / positionMethod — provenance of every enrichment decision

Parsed output for the Milkmaid (excerpt)

The first five events as structured rows. Each row is one semicolon-delimited segment from the raw text. Where a segment names two co-owners in parentheses (e.g. “X (dates) and Y (dates)”), the grammar currently captures the first party only — a known limitation.

#	Type	Party	Position	Location	Date	Price	Method
1	collection	Pieter Claesz van Ruijven (1624-1674)	receiver	Delft	—	—	peg
2	by_descent	Magdalena van Ruijven (1655-1682)	receiver	Delft	—	—	peg
3	widowhood	Jacob Dissius (1653-1695)	receiver	Delft	—	—	peg
4	sale	Isaac Rooleeuw (1663-1701)	receiver	Amsterdam	16 May 1696	fl. 175	peg
…	gap in the documented ownership chain
5	collection	Lucretia Johanna van Winter (1785-1845)	receiver	Amsterdam	1817	—	peg

Transfer type vocabulary

19 types aligned with the CMOA Art Tracks standard, organised by whether they transfer ownership (legal title) or only custody (temporary possession).

Ownership transfers

sale15,657

collection18,561

gift10,897

by_descent13,663

bequest4,378

widowhood3,455

commission580

recuperation853

restitution86

inventory122

inheritance41

confiscation19

looting5

exchange5

theft4

Custody transfers

loan6,301

deposit244

Other / unresolved

transfer (administrative)6,275

non_provenance (Step 7 reclassification)67

unknown (incl. cross-refs)19,879

The 19,879 unknown events include 19,872 cross-references ("see provenance of SK-A-XXXX") that link to companion works and are not true unknowns. After excluding cross-references, only 7 events remain genuinely unresolved.

Where the parser needed help from an LLM

The deterministic parser (Parsing Expression Grammar + regex fallback + rule-based inference) resolved ~97% of events. For the remaining cases, an LLM (Claude Sonnet) was used in targeted batch passes to resolve three categories of ambiguity that require world knowledge or contextual reasoning.

Three problems that required an LLM

1. Transfer type classification

166 events used institution-specific jargon, archaic terminology, or non-standard phrasing that no keyword rule could reliably match. The LLM classified each with a transfer type and written reasoning.

"placed at the disposal of the SNK"
→ LLM: deposit
Reasoning: "SNK administered wartime
deposits; this is custody, not ownership"

2. Party position disambiguation

1,276 parties had no deterministic role mapping. The LLM inferred sender/receiver/agent from the narrative context and the transfer type of the event.

"with dealer [Name], to the museum"
→ LLM: [Name] = sender
Reasoning: "dealer is the intermediary
selling to the museum (receiver)"

3. Merged party decomposition

439 party records were structurally split from merged text (e.g. “X, his son Y”) and reassigned positions by the LLM, which inferred each person's role from the narrative.

"Jan de Vos, his son Pieter de Vos"
→ LLM: split into 2 parties
Jan de Vos (sender),
Pieter de Vos (receiver)

From 3,204 unknowns to 7

In the end, a series of grammar improvements, structural signal detection, deterministic rules, and targeted LLM passes reduced the (non-cross-reference) unknowns by 99.8%.

Starting unknowns

3,204

Grammar improvements

−1,890

PEG rules

Structural signals

−832

regex

Unsold reclassification

−230

rule

Credit-line events

−79

rule

Transfer type classification

−166

LLM

Remaining