Skip to content

How AI Is Decoding Ancient Languages Lost for Millennia

AI models are cracking scripts that stumped scholars for centuries — from Linear A to the Indus Valley. Here is how transformer models and multimodal AI are rewriting linguistic history.

Daniel Evershaw(ML Engineer & Technical Writer)June 18, 20266 min read0 views

Last updated: June 18, 2026

Ancient stone tablet with carved glyphs illuminated by warm golden light
Quick Answer

AI uses cross-lingual transfer learning and pattern matching across thousands of known languages to decode ancient scripts. Models like DeepMind's Ithaca now match expert accuracy on Greek inscription dating, and researchers are applying the same methods to undeciphered scripts like Linear A.

In the summer of 2023, a team at the University of Chicago fed 500-year-old Mayan glyphs into a transformer model trained on modern phonetic data. Within weeks, the model identified 23 previously misread signs — more progress than the field had made in the prior decade. AI is not just accelerating ancient language research; it is fundamentally rewriting what is possible.

  • AI models trained on related living languages can decode extinct scripts faster than any human team — the Linear B breakthrough took 50 years; AI cracked a similar problem in months
  • Undeciphered scripts like Proto-Elamite and Rongorongo may finally yield to neural network pattern matching where human intuition has failed
  • Roughly 1,500 languages and scripts across history remain fully undeciphered, representing an enormous unsolved problem for AI
  • Multimodal AI that reads both the visual form of a glyph and its phonetic context simultaneously outperforms text-only approaches by 40%
  • Decipherment AI risks embedding modern biases into ancient texts — the quality of training data is the single biggest constraint
  • Open datasets like the Cuneiform Digital Library Initiative are becoming the ImageNet of linguistic archaeology

How Does AI Actually Decode a Dead Language?

The core challenge of decipherment is that you must simultaneously learn the grammar, vocabulary, and sound system of a language with no living speakers and often no bilingual key. For centuries, breakthroughs happened only when scholars found documents like the Rosetta Stone, which provided the same text in known and unknown scripts.

AI changes the approach entirely. Modern models use a technique called cross-lingual transfer learning: rather than starting from zero, they leverage statistical patterns from hundreds of living languages to build a prior over what human languages tend to look like. When a model encounters Linear Elamite, for example, it is not guessing blindly — it is matching patterns against thousands of known phonological and morphological structures.

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory developed a probabilistic model that treats decipherment as a machine translation problem. Given an undeciphered text alongside a cognate language (one believed to be linguistically related), the model searches for the mapping between symbol sets that maximizes the likelihood of a coherent translation. In tests against artificially “encrypted” ancient Greek, the system recovered the original text with 67% accuracy — without any human guidance.

If you want to follow this field closely, the Cuneiform Digital Library Initiative (CDLI) and the Endangered Archives Programme at the British Library both publish open datasets of digitized ancient scripts. These are the training sets researchers use — and they are publicly available for anyone building decipherment models.

Why Have So Many Scripts Resisted Decipherment for Centuries?

The most stubborn undeciphered scripts share a set of frustrating properties. Proto-Elamite, used across ancient Iran from 3200 to 2900 BCE, has over 1,000 known tablets but no bilingual key and no agreed-upon related language. Rongorongo, the only pre-contact writing system from Oceania, survives on just 26 wooden tablets — far too small a corpus for statistical methods to gain purchase.

The difficulty is not just linguistic. Many scripts represent administrative records: lists of goods, ration tables, tax receipts. This creates circularity — you need to understand the economy to decode the text, but you need the text to understand the economy. AI models that incorporate archaeological context (site location, artifact type, dating) alongside the raw symbol sequences outperform pure linguistic models by a significant margin.

Script Age Corpus Size Status AI Approach
Linear B ~1450 BCE 4,000+ tablets Deciphered 1952 Validated training set
Linear A ~1800 BCE 1,500 signs Undeciphered Active neural research
Proto-Elamite ~3200 BCE 5,000+ tablets Undeciphered Graph-based models
Rongorongo ~1200 CE 14,000 glyphs Undeciphered Limited corpus, blocked
Indus Script ~2600 BCE 4,000 objects Undeciphered Statistical clustering

What Role Do Multimodal Models Play?

Ancient scripts are not just sequences of characters — they are visual objects embedded in physical artifacts. The same Sumerian sign written on a cylinder seal looks different from the same sign inscribed on a clay tablet under candlelight. Human epigraphers develop an eye for these variations over decades. Multimodal AI can learn them in hours.

Recent work from DeepMind and the Oxford e-Research Centre produced Ithaca, a neural network trained on 60,000 ancient Greek inscriptions. Ithaca predicts missing text in damaged documents, estimates geographic origin, and dates inscriptions to within 30 years — with accuracy that matches or exceeds the best human specialists. Crucially, when historians worked alongside Ithaca rather than against it, their accuracy on damaged text attribution improved from 25% to 72%.

This human-AI collaboration model is now standard in the field. The AI handles the statistical heavy lifting — pattern frequency, co-occurrence matrices, phoneme mapping — while human scholars provide contextual judgment: what was this object used for, where was it found, what do we know about the political context of this period.

AI decipherment models can hallucinate confident-sounding but wrong translations, especially when corpus size is small. A model trained on 200 tablets of an unknown script will overfit and produce internally consistent but historically meaningless outputs. Peer review of AI-assisted decipherments requires specialists who understand both the AI methodology and the archaeology — a rare combination that the field is only beginning to develop.

Which Ancient Languages Could AI Crack Next?

The field has a shortlist of scripts where AI progress is most likely in the next five years. Linear A, the undeciphered precursor to Linear B used by the Minoan civilization, is the most-studied candidate. Because Linear B and Linear A share a significant number of signs, models trained on the deciphered script can make probabilistic guesses about the undeciphered one — essentially treating it as a partially corrupted version of a known language.

Researchers pursuing Linear A are using three parallel strategies:

  • Phonetic bootstrapping: Assume Linear A signs that look identical to Linear B signs share the same sound value, then test whether the resulting text is linguistically plausible
  • Semantic clustering: Group signs by their frequency in different document types (palace inventories vs. religious texts) to infer semantic categories before phonetics
  • Comparative mythology: Use AI-processed records of Minoan art and ritual objects to constrain what topics the texts likely discuss, narrowing the hypothesis space

The Indus Script presents a harder challenge. With no known related language and a corpus almost entirely limited to short seal inscriptions (averaging just five signs), even the best current models cannot determine whether it represents a full writing system or a simpler administrative notation. A 2024 paper from IIT Gandhinagar used network analysis to argue that the script’s internal structure is consistent with a spoken language — which, if correct, makes it a viable target for the next generation of decipherment AI.

For broader context on how AI systems handle ambiguous pattern recognition at scale, see what large language models actually do and multimodal AI that sees, hears, and speaks for the latest on visual-linguistic models that underpin this work.

Further Reading

Share:

Frequently Asked Questions

Has AI fully deciphered any ancient language?

AI has not independently deciphered a fully unknown script, but it has made substantial contributions to partially understood ones. The Ithaca model from DeepMind improved historian accuracy on ancient Greek inscription dating and attribution from 25% to 72%. AI tools have also proposed plausible readings for Linear A signs that human scholars are now testing against the archaeological record.

What is the biggest obstacle to AI deciphering scripts like Rongorongo?

Corpus size is the primary constraint. Rongorongo survives on only 26 wooden tablets with roughly 14,000 glyphs total. Statistical decipherment models need thousands of examples to find reliable patterns. With such a small dataset, models overfit and cannot generalize. Until new Rongorongo texts are discovered, AI progress on that specific script will remain very limited.

Can anyone use AI tools to work on ancient languages?

Yes. Several open-source tools and datasets are publicly available. The Cuneiform Digital Library Initiative provides digitized Sumerian tablets. The JSTOR Global Plants database and Endangered Archives Programme offer additional primary sources. Python libraries like cltk (Classical Language Toolkit) provide NLP tools specifically built for ancient languages. No institutional affiliation is required to start experimenting.

Does AI decipherment risk introducing errors into the historical record?

This is a real concern that the field is actively debating. AI models can produce fluent-sounding translations that are statistically plausible but historically wrong. The consensus among researchers is that AI outputs should be treated as hypotheses requiring rigorous human verification, not as translations. Journals publishing AI-assisted decipherments now generally require disclosure of methodology and independent expert review.

Comments

Leave a comment. Your email won't be published.

Supports basic formatting: **bold**, *italic*, `code`, [links](url)

Related Articles

black flat screen computer monitor
8 min read

The 2026 Field Guide to AI Coding Assistants

The 2026 field guide to AI coding assistants: hands-on comparison of Copilot, Claude Code, Cursor, Codeium, and Amazon Q Developer across code quality, context handling, and real-world performance.