Decipherment of Substitution Ciphers with Neural Language Models

Kambhatla, Nishant; Bigvand, Anahita Mansouri; Sarkar, Anoop

doi:10.18653/v1/d18-1102

Cited by 16 publications

(20 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These adjustments are informed by assumptions about ciphers used to produce the data (Knight and Yamada, 1999;Knight et al, 2006;Ravi and Knight, 2011;Pourdamghani and Knight, 2017). Besides the commonly used EM algorithm, (Nuhn et al, 2013;Hauer et al, 2014;Kambhatla et al, 2018) also tackles substitution decipherment and formulate this problem as a heuristic search procedure, with guidance provided by an external language model (LM) for candidate rescoring. So far, techniques developed for man-made ciphers have not been shown successful in deciphering archaeological data.…”

Section: Related Workmentioning

confidence: 99%

Neural Decipherment via Minimum-Cost Flow: From Ugaritic to Linear B

Luo¹,

Cao²,

Barzilay³

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

In this paper we propose a novel neural approach for automatic decipherment of lost languages. To compensate for the lack of strong supervision signal, our model design is informed by patterns in language change documented in historical linguistics. The model utilizes an expressive sequence-to-sequence model to capture character-level correspondences between cognates. To effectively train the model in an unsupervised manner, we innovate the training procedure by formalizing it as a minimum-cost flow problem. When applied to the decipherment of Ugaritic, we achieve a 5.5% absolute improvement over state-of-the-art results. We also report the first automatic results in deciphering Linear B, a syllabic language related to ancient Greek, where our model correctly translates 67.3% of cognates. 1

show abstract

Section: Related Workmentioning

confidence: 99%

Neural Decipherment via Minimum-Cost Flow: From Ugaritic to Linear B

Luo¹,

Cao²,

Barzilay³

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…They usually use EM algorithms, which are tailored towards these specific types of ciphers, most prominently substitution ciphers (Knight and Yamada, 1999;Knight et al, 2006). Later work by Nuhn et al (2013), Hauer et al (2014, and Kambhatla et al (2018) addresses the problem using a heuristic search procedure, guided by a pretrained language model. To the best of our knowledge, these methods developed for tackling man-made ciphers have so far not been successfully applied to archaeological data.…”

Section: Related Workmentioning

confidence: 99%

Deciphering Undersegmented Ancient Scripts Using Phonetic Prior

Luo

Hartmann

Santus

et al. 2021

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

Most undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges: (1) the scripts are not fully segmented into words; (2) the closest known language is not determined. We propose a decipherment model that handles both of these challenges by building on rich linguistic constraints reflecting consistent patterns in historical sound change. We capture the natural phonological geometry by learning character embeddings based on the International Phonetic Alphabet (IPA). The resulting generative framework jointly models word segmentation and cognate alignment, informed by phonological constraints. We evaluate the model on both deciphered languages (Gothic, Ugaritic) and an undeciphered one (Iberian). The experiments show that incorporating phonetic geometry leads to clear and consistent gains. Additionally, we propose a measure for language closeness which correctly identifies related languages for Gothic and Ugaritic. For Iberian, the method does not show strong evidence supporting Basque as a related language, concurring with the favored position by the current scholarship. 1

show abstract

“…Our work exploits the embedding space learned by a neural language model, but the actual task of language modeling is otherwise irrelevant to our results. By contrast, Kambhatla et al (2018) actually sample text from a neural language model to help estimate the quality of a proposed decipherment. Future work could similarly sample from a language model as a means of counteracting the small size of the PE corpus; this should be done with caution, however, given the difficulty of evaluating whether the sampled text is fluent.…”

Section: Below)mentioning

confidence: 99%

Compositionality of Complex Graphemes in the Undeciphered Proto-Elamite Script using Image and Text Embedding Models

Born

Kelley²,

Monroe

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Self Cite

View full text Add to dashboard Cite

We introduce a language modeling architecture which operates over sequences of images, or over multimodal sequences of images with associated labels. We use this architecture alongside other embedding models to investigate a category of signs called complex graphemes (CGs) in the undeciphered proto-Elamite script. We argue that CGs have meanings which are at least partly compositional, and we discover novel rules governing the construction of CGs. We find that a language model over sign images produces more interpretable results than a model over text or over sign images and text, which suggests that the names given to signs may be obscuring signals in the corpus. Our results reveal previously unknown regularities in proto-Elamite sign use that can inform future decipherment efforts, and our image-aware language model provides a novel way to abstract away from biases introduced by human annotators.

show abstract

Decipherment of Substitution Ciphers with Neural Language Models

Cited by 16 publications

References 10 publications

Neural Decipherment via Minimum-Cost Flow: From Ugaritic to Linear B

Neural Decipherment via Minimum-Cost Flow: From Ugaritic to Linear B

Deciphering Undersegmented Ancient Scripts Using Phonetic Prior

Compositionality of Complex Graphemes in the Undeciphered Proto-Elamite Script using Image and Text Embedding Models

Contact Info

Product

Resources

About