Can Sequence-to-Sequence Models Crack Substitution Ciphers?

Aldarrab, Nada; May, Jonathan

doi:10.18653/v1/2021.acl-long.561

Cited by 1 publication

(8 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Seq2Seq Following (Aldarrab and May, 2021), this is a character level Transformer architecture that is only optimized on the target-side (plaintext) loss: Target-Only CausalLM is only optimized on the target-side loss L T GT , and incurs no loss when generating the source text.…”

Section: Modelling Symbol Recurrence Relationsmentioning

confidence: 99%

“…For simple substitution ciphers, we use the same English data as above to create 1.2M synthetic substitution ciphers with lengths up to 256. Following previous work on 1:1 ciphers (Nuhn et al, 2013;Aldarrab and May, 2021), we evaluate on 50 test ciphers of lengths up to 128 (16,32,64) and beyond 128 (128,256) from the Wikipedia page on History 5 . All our experimental settings include data with word boundaries denoted by the space symbol (_).…”

Section: Modelling Symbol Recurrence Relationsmentioning

confidence: 99%

“…Evaluation Following prior work Aldarrab and May, 2021), we evaluate on Symbol Error Rate (SER), the proportion of ciphertext symbols which are wrongly recovered.…”

Section: Model Detailsmentioning

confidence: 99%

“…CipherGAN (Gomez et al, 2018) exploits learned letter embedding distributions, but requires a large volume of ciphertext and only handles 1:1 substitution and Vigenère ciphers. Luo et al (2021) and Aldarrab and May (2022) pro-pose techniques to decipher undersegmented ciphers. Aldarrab and May (2021) train a sequenceto-sequence neural translation model to decipher from character frequencies.…”

Section: Other Related Workmentioning

confidence: 99%

“…Automated computational decipherment of such texts is challenging (Pettersson and Megyesi, 2019;Megyesi et al, 2020). Prior work has mainly focused on using clever heuristics and/or search algorithms to explore the space of cipher keys and score multiple candidate plaintexts under character language models (LMs) (Knight et al, 2006;Corlett and Penn, 2010;Hauer et al, 2014;Berg-Kirkpatrick and Klein, 2013;Nuhn et al, 2013Nuhn et al, , 2014 In contrast Aldarrab and May (2021) train a sequence-tosequence model to solve simple (one-to-one) substitution ciphers. This approach, however, cannot solve complex homophonic ciphers as it relies on frequency information which such ciphers obscure.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Decipherment as Regression: Solving Historical Substitution Ciphers by Learning Symbol Recurrence Relations

Kambhatla,

Born,

Sarkar

2023

Findings of the Association for Computational Linguistics: EACL 2023

View full text Add to dashboard Cite

Solving substitution ciphers involves mapping sequences of cipher symbols to fluent text in a target language. This has conventionally been formulated as a search problem, to find the decipherment key using a character-level language model to constrain the search space. This work instead frames decipherment as a sequence prediction task, using a Transformer-based causal language model to learn recurrences between characters in a ciphertext. We introduce a novel technique for transcribing arbitrary substitution ciphers into a common recurrence encoding. By leveraging this technique, we (i) create a large synthetic dataset of homophonic ciphers using random keys, and (ii) train a decipherment model that predicts the plaintext sequence given a recurrence-encoded ciphertext. Our method achieves strong results on synthetic 1:1 and homophonic ciphers, and cracks several real historic homophonic ciphers. Our analysis shows that the model learns recurrence relations between cipher symbols and recovers decipherment keys in its self-attention. 1

show abstract

Section: Modelling Symbol Recurrence Relationsmentioning

confidence: 99%

Section: Modelling Symbol Recurrence Relationsmentioning

confidence: 99%

“…Evaluation Following prior work Aldarrab and May, 2021), we evaluate on Symbol Error Rate (SER), the proportion of ciphertext symbols which are wrongly recovered.…”

Section: Model Detailsmentioning

confidence: 99%