Data Pipeline | mapping al-shawwām

At a Glance

Every mark on the map is the visible end of a chain. A page was scanned, transcribed, read for names and places, recorded as its source wrote it, anchored to coordinates and dates, and only then interpreted — and every link in that chain is a named, public, revisable step. Scans, court registers, gazetteers, genealogical compendia, memoirs, maps, and administrative lists enter that chain carrying the mediations that produced them: the registrar's hand, the cartographer's projection, the route a name traveled before it reached a page.

The pipeline keeps those mediations legible rather than resolving them into a single authoritative layer. Attestation and claim, place and coordinate, the geography a source assumes and the geography the interface can show — each pair is held apart, so a reader can always ask how one became the other. What appears on the map is historical claims in circulation: each traceable back through interface, database, and archive to the page that made it.

Two independent streams feed this chain — one from scanned maps, one from source texts. The diagram below shows where they meet.

How Two Streams Meet

Two independent processes feed the same map. A scanned historical map is georeferenced and its boundaries traced; a source text is transcribed and its claims captured verbatim. Neither stream knows about the other until the database — and from there, the interface reads them together and holds them answerable to each other.

From a map

Scanned sheetA historical map enters as a scan, its provenance recorded before anything else.
GeoreferencingControl points anchor the sheet to modern coordinates.
Boundary tracingAn administrative line becomes a polygon that carries its date and the administration that drew it.
Validated hand-offNothing crosses into the database without passing a contract both sides test against.

Curated reference data

GazetteerPlaces held apart from any single source’s claims about them.
BibliographyEvery source citable on its own terms.
PolitiesOttoman, Mandate, and successor administrations, linked by documented succession.

From a page

Source pageA dictionary or genealogy entry enters as a scan and a transcription.
Verbatim captureWhat the page asserts is recorded as written — checked across independent readings, never paraphrased.
Entity, place, journeyThe record becomes an entity; its places resolve; its movement becomes a journey the map can draw.
Interpretation, lastModern readings are derived late, by documented rules, and can always be traced back and undone.

One database

Both streams, and the reference data they draw on, land in the same tables — each row still carrying its source.

Move the timeline, and both layers answer.

One control filters which boundary polygons render and which attestations show.

A marker, a polygon, and a citation move together.

Selecting a place surfaces the boundary drawn in its era and the source that named it.

Every name resolves against the same gazetteer.

Surnames, boundaries, and markers share one curated index of places — so their disagreements stay legible.

The complete seven-phase architecture behind both streams is laid out in the full methodology below, expandable stage by stage.

What We Ingest

The pipeline does not treat its inputs as a uniform pool of data. It distinguishes between three kinds of material, because each kind enters the chain with different commitments attached.

Documentary traces. Court registers, gazetteers, census returns, genealogical compendia, memoirs, travel accounts, administrative correspondence, and historical maps. These arrive with their own conventions of authority — what counts as a place, what counts as a person, what counts as a boundary — and the pipeline preserves those conventions rather than translating them into a single house style.

Named entities and relations. People, surnames, lineages, places, administrative units, dates, journeys, and the citations that bind them. These are not extracted as isolated facts. They are extracted as claims about who was where, when, and on whose attestation.

Spatial frameworks. Coordinates, place hierarchies, polity systems, boundary files, and temporal validity ranges. These are the scaffolds against which the other two kinds of material become legible on a map. They are also the scaffolds most likely to impose anachronism if used carelessly, and they are versioned and dated for that reason.

Step by Step

Trace — A scan, page, map, table, register, or bibliographic entry enters the system with source metadata attached. The first judgment is not what the material “means,” but what kind of source-form it is and where its authority comes from.
Transcription — OCR, HTR, or LLM-assisted vision converts images into text. Noise, layout, damaged text, marginalia, and uncertain readings are not treated as invisible; they remain part of the record’s condition.
Normalization — Names, dates, scripts, and transliterations are cleaned enough to be searched and compared. Original forms remain attached where they carry historical, linguistic, or evidentiary significance.
Extraction — NER and LLM-assisted extraction identify people, surnames, places, dates, affiliations, routes, citations, and administrative terms. These are extracted as claims made by a source, not as facts detached from it.
Structuring — Extracted claims become schema-shaped records with stable identifiers. The schema keeps a person, the names attached to that person, a place, and the coordinates attached to that place as separate objects, so the relationships between them remain inspectable rather than fused.
Validation — Records are checked against required fields, source links, coordinate plausibility, hierarchy alignment, and temporal ranges. Conflicts and gaps are flagged rather than silently repaired.
Enrichment — Validated records are linked outward to coordinates, place hierarchies, polity systems, boundary layers, temporal metadata, and confidence notes. Enrichment makes comparison possible, but it does not erase the path by which the claim arrived.
Publication — Curated, read-only views expose the database to the site through the API. The interface receives claims with their provenance and uncertainty still attached.
Interface — Maps, timelines, panels, labels, and journeys make the claims visible. The goal is not to settle the geography, but to let users follow how places, names, and routes circulate across sources.

Full Methodology

What follows is the complete seven-phase architecture behind both streams. Click any phase or component to see the conventions and judgments it carries.

Documentary Traces

Historical Documents Genealogical Compendia Gazetteers Administrative Records

Recognition & Structuring

Images Raw Text Structured Text

Validation & Anchoring

Schema Mapping Validation Enrichment & Geocoding

Schemas & Storage

SOURCE SURNAME LOCATIONS BOUNDARIES

Feature Assemblages

Map Features Timeline-Aware Layers Crosslinked Relations API & Domain Objects

Interactive Interfaces

Timeline Sliders List / Details / Filters Panels Drilldown Mode Journeys & Stories Mode Contested Geographies

Future Affordances

Auth / Login Custom Dataset Imports Attestation Submission Confidence Score Shaping Video / Still Export

Note: The diagram is a guide, not a flowchart of facts. What moves between phases are claims about the past, each tied to the source that made them and the conventions that produced it.

Capture and Interpretation Are Kept Apart

The pipeline separates two jobs that most data systems fuse. The tools and people that transcribe a source are held to faithful capture: names as spelled, dates as printed, affiliations only where the text states them, empty fields where the source is silent. Interpretation — reading a historical claim into modern terms — happens later, in a single documented and revisable step, kept apart from capture so that each can be audited on its own. A reader can therefore always distinguish what a source said from what the platform made of it.

Nowhere does this matter more than in the geography of Palestine. When al-ʿAmmārī writes Filasṭīn, he means historic Palestine in its entirety, including what became Israel in 1948; his Palestine-or-Jordan framing is Mandate-era logic, not a modern border. The platform records that usage as he wrote it. Where a modern reading is derived — which present-day polity a historical place falls within — the derivation runs through documented lines of administrative succession, accepts a match only above a set confidence threshold, and remains reversible. The platform does not decide what Palestine is; it shows what each source asserted, and exactly how any modern label was reached.

Time-Aware Boundaries

Boundaries on the platform are not timeless containers. A district line drawn under late Ottoman administration, a sub-district reorganized under Mandate rule, a frontier hardened after 1948, and a locally remembered region that crosses all three are not the same kind of object, and the platform does not pretend they are. Each boundary carries the date of its source, the administrative vocabulary that produced it, and the degree of fit between that vocabulary and the territory it claimed to describe.

Where boundaries overlap, the platform shows the overlap. Where a region in one source has no equivalent in another, the platform does not invent one. Where a boundary’s exact line is uncertain, that uncertainty is preserved rather than smoothed into a confident polygon. The geography of Bilad al-Sham is not the residue of one administrative regime; it is the layered, sometimes contradictory record of many, and the platform’s job is to keep those layers distinguishable.

Accountability, not certainty

The platform does not promise that its geographies are correct. It promises that the path by which a place became a coordinate remains attached to the coordinate, and that disagreement between sources is preserved rather than adjudicated. A village named differently in an Ottoman sijill, a Mandate gazetteer, and a post-1948 map is not three errors to reconcile; it is three attestations to hold side by side, each with its date, its source, and its administrative vocabulary intact.

Where a location is uncertain, the record says so. Where coordinates are weak, where a hierarchy is contested, where a name has variants the platform cannot rank, those gaps are marked rather than smoothed. Public access is read-only — enforced in the database itself — not because the archive is finished, but because revision should leave a trace. The pipeline is built on the conviction that a geography held accountable to its sources can be corrected, extended, and argued with — and that is what this one is for.