From raw sources to interactive app of historical memory and imagined geographies.
At a Glance
This pipeline takes scattered historical materials—scans, PDFs, gazetteers, court records, tribal dictionaries—and turns them into structured, time-aware geographic data. Each phase below corresponds to a real layer in the stack: inputs, extraction, structuring and validation, database schemas, orchestration, and the interactive app. Click any phase or component to see what it does and how it connects to the others.
Note:
Multiple historical sources feed into crosslinked schemas, enabling temporally-aware feature orchestration and interactive exploration of contested geographic imaginaries.
What We Ingest
Gazetteers — Lists of historical place names with coordinates, administrative hierarchies, and transliteration variants.
Sources & Surnames — Structured exports distilled from historical documents (bibliography, tribal data, place references) with stable IDs.
Boundaries (Shapefiles) — GeoJSON boundary files for provinces, districts, and other units across different time periods.
Step by Step
Image — Scans or photographs of manuscripts, books, registers, or maps.
OCR/HTR — Software turns images into raw text (OCR for print; HTR for handwriting), with LLM vision assisting on complex layouts.
Cleanup — We remove scanning noise, fix line breaks, normalize characters (including Arabic script), and keep original spellings where relevant.
Structuring — The cleaned text is organized into structured arrays of objects (places, people, citations, dates, relations) using NER and LLM-based extraction.
ETL & Validation — Structured objects are mapped into the core schemas (SOURCE, SURNAME, LOCATIONS, BOUNDARIES), checked for consistency, enriched with coordinates and temporal tags, and flagged if they need human review.
SQL Database — Validated records are loaded into a secure, spatially-enabled, read-only database with cross-linked tables and indexes for search and mapping.
App API — A safe, read-only gateway exposes curated views and domain objects (e.g., "surname with journeys", "place with sources") to the website.
App — The site displays maps, timelines, journeys, and search results built from those curated datasets.
Time-Aware Boundaries
Administrative borders shift over time. To avoid anachronism, our boundary files are temporally contingent:
Periodized layers — Boundaries are tagged with a year or date range (e.g., 1890–1918) and selected based on the time in view.
Multiple sources — We compare historical atlases, official gazetteers, and archival maps; disagreements are flagged for review.
Best-available fit — When exact dates are uncertain, we choose the most defensible time slice and mark it as such.
User experience — When you move through time on the map, boundaries and labels update to match the chosen period.
Quality & Trust
Provenance preserved — Records keep citations, notes, and (when relevant) confidence scores.
Coordinate checks — If a place cannot be precisely located, the system records the issue for human review.
Read-only by design — Public pages cannot alter the database; changes happen through controlled updates.