Decipherment of Historical Manuscript Images

Yin, Xusen; Aldarrab, Nada; Megyesi, Beáta; Knight, Kevin

doi:10.1109/icdar.2019.00022

Cited by 15 publications

(15 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since transcription and deciphering are usually separated subsequent tasks, errors in transcription are propagated, affecting heavily the decryption. Therefore, we investigate the integration of image processing and automatic decryption into one single step (as was proposed for the first time by Kevin Knight and presented in a pilot study (Yin et al 2019)) or into an iterated pipeline (feedback). This joint architecture will hopefully fasten the time-consuming transcription, minimize the errors and create synergy effects as both image processing and automatic decryption tools rely on statistical language models and clustering of symbols, which could be commonly used.…”

Section: Discussionmentioning

confidence: 99%

Decryption of historical manuscripts: the DECRYPT project

et al. 2020

View full text Add to dashboard Cite

Many historians and linguists are working individually and in an uncoordinated fashion on the identification and decryption of historical ciphers. This is a time-consuming process as they often work without access to automatic methods and processes that can accelerate the decipherment. At the same time, computer scientists and cryptologists are developing algorithms to decrypt various cipher types without having access to a large number of original ciphertexts. In this paper, we describe the DECRYPT project aiming at the creation of resources and tools for historical cryptology by bringing the expertise of various disciplines together for collecting data, exchanging methods for faster progress to transcribe, decrypt and contextualize historical encrypted manuscripts. We present our goals and work-in progress of a general approach for analyzing historical encrypted manuscripts using standardized methods and a new set of state-of-the-art tools. We release the data and tools as open-source hoping that all mentioned disciplines would benefit and contribute to the research infrastructure of historical cryptology.

show abstract

Section: Discussionmentioning

confidence: 99%

Decryption of historical manuscripts: the DECRYPT project

et al. 2020

View full text Add to dashboard Cite

show abstract

“…In our data, it is not known which signs are truly related to one another, thus we refrain from giving the model explicit information about compositionality. Yin et al (2019) segment and transcribe undeciphered scripts based on visual similarities between glyphs. Although their transcription error rate is high, they still achieve partial decipherments with no human intervention.…”

Section: Below)mentioning

confidence: 99%

Compositionality of Complex Graphemes in the Undeciphered Proto-Elamite Script using Image and Text Embedding Models

Born

Kelley²,

Monroe

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

We introduce a language modeling architecture which operates over sequences of images, or over multimodal sequences of images with associated labels. We use this architecture alongside other embedding models to investigate a category of signs called complex graphemes (CGs) in the undeciphered proto-Elamite script. We argue that CGs have meanings which are at least partly compositional, and we discover novel rules governing the construction of CGs. We find that a language model over sign images produces more interpretable results than a model over text or over sign images and text, which suggests that the names given to signs may be obscuring signals in the corpus. Our results reveal previously unknown regularities in proto-Elamite sign use that can inform future decipherment efforts, and our image-aware language model provides a novel way to abstract away from biases introduced by human annotators.

show abstract

“…This noise can come from the natural degradation of historical documents, human mistakes during a manual transcription process, or misspelled words by the author, as in the Zodiac-408 cipher. Noise can also come from automatically transcribing historical ciphers using Optical Character Recognition (OCR) techniques (Yin et al, 2019). It is thus crucial to have a robust decipherment model that can still crack ciphers despite the noise.…”

Section: Transcription Noisementioning

confidence: 99%

“…Hauer et al (2014) test their proposed method on noisy ciphers created by randomly corrupting log 2 (N ) of the ciphertext characters. However, automatic transcription of historical documents is very challenging and can introduce more types of noise, including the addition and deletion of some characters during character segmentation (Yin et al, 2019). We test our model on three types of random noise: insertion, deletion, and substitution.…”

Section: Transcription Noisementioning

confidence: 99%

Can Sequence-to-Sequence Models Crack Substitution Ciphers?

Aldarrab¹,

May²

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

Decipherment of historical ciphers is a challenging problem. The language of the target plaintext might be unknown, and ciphertext can have a lot of noise. State-of-the-art decipherment methods use beam search and a neural language model to score candidate plaintext hypotheses for a given cipher, assuming the plaintext language is known. We propose an end-to-end multilingual model for solving simple substitution ciphers. We test our model on synthetic and real historical ciphers and show that our proposed method can decipher text without explicit language identification while still being robust to noise.

show abstract

Decipherment of Historical Manuscript Images

Cited by 15 publications

References 11 publications

Decryption of historical manuscripts: the DECRYPT project

Decryption of historical manuscripts: the DECRYPT project

Compositionality of Complex Graphemes in the Undeciphered Proto-Elamite Script using Image and Text Embedding Models

Can Sequence-to-Sequence Models Crack Substitution Ciphers?

Contact Info

Product

Resources

About