SeedLing: Building and Using a Seed corpus for the Human Language Project

Emerson, Guy; Tan, Ling; Fertmann, Susanne; Palmer, Alexis; Regneri, Michaela

doi:10.3115/v1/w14-2211

Cited by 10 publications

(5 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We now directly evaluate the three methods described above by applying them to a set of ciphertexts from different languages. We adapted the dataset created by Emerson et al (2014) from the text of the Universal Declaration of Human Rights (UDHR) in 380 languages. 2 The average length of the texts is 1710 words and 11073 characters.…”

Section: Discussionmentioning

confidence: 99%

Decoding Anagrammed Texts Written in an Unknown Language and Script

Hauer

Kondrak

2016

TACL

View full text Add to dashboard Cite

Algorithmic decipherment is a prime example of a truly unsupervised problem. The first step in the decipherment process is the identification of the encrypted language. We propose three methods for determining the source language of a document enciphered with a monoalphabetic substitution cipher. The best method achieves 97% accuracy on 380 languages. We then present an approach to decoding anagrammed substitution ciphers, in which the letters within words have been arbitrarily transposed. It obtains the average decryption word accuracy of 93% on a set of 50 ciphertexts in 5 languages. Finally, we report the results on the Voynich manuscript, an unsolved fifteenth century cipher, which suggest Hebrew as the language of the document.

show abstract

Section: Discussionmentioning

confidence: 99%

Decoding Anagrammed Texts Written in an Unknown Language and Script

Hauer

Kondrak

2016

TACL

View full text Add to dashboard Cite

show abstract

“…In this paper, we have described our submission to the Diachronic Text Evaluation for SemEval-2015. 5 The cleaning tool used is a compilation of web cleaning scripts (Emerson et al, 2014;Tan et al, 2014b;Tan and Bond, 2011) We have adapted a web crawler to search for the source of the text snippets used for the evaluation and achieved the highest precision score. Additionally, we have crawled and cleaned the source articles of the snippets and produced the Daikon corpus that can be used for future research in diachronic/temporal analysis and epoch identification.…”

Section: Discussionmentioning

confidence: 99%

USAAR-CHRONOS: Crawling the Web for Temporal Annotations

Tan

Ordan

2015

Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

Self Cite

View full text Add to dashboard Cite

This paper describes the USAAR-CHRONOS participation in the Diachronic Text Evaluation task of SemEval-2015 to identify the time period of historical text snippets. We adapt a web crawler to retrieve the original source of the text snippets and determine the publication year of the retrieved texts from their URLs. We report a precision score of >90% in identifying the text epoch. Additionally, by crawling and cleaning the website that hosts the source of the text snippets, we present Daikon, a corpus that can be used for future work on epoch identification from a diachronic perspective.

show abstract

“…We chose Dangerous Connections, an English translation of an epistolary novel, for deriving character-level language models; and a much larger New York Times Corpus 5 for deriving word-level language models. For our language identification e xperiments, w e u se a d ataset constructed from 380 translations of the Universal Declaration of Human Rights (UDHR) (Emerson et al, 2014), and the multilingual OpenSubtitles corpus of movie subtitles (Lison and Tiedemann, 2016).…”

Section: Music Deciphermentmentioning

confidence: 99%

Experimental Analysis of the Dorabella Cipher with Statistical Language Models

Hauer¹,

Choi²,

Sundar³

et al. 2021

Linköping Electronic Conference Proceedings

View full text Add to dashboard Cite

The Dorabella cipher is a symbolic message written in 1897 by English composer Edward Elgar. We analyze the cipher using modern computational and statistical techniques. We consider several open questions: Is the underlying message natural language text or music? If it is language, what is the most likely language? Is Dorabella a simple substitution cipher? If so, why has nobody managed to produce a plausible decipherment? Are some unusual-looking patterns in the cipher likely to occur by chance? Can stateof-the-art algorithmic solvers decipher at least some words of the message? This work is intended as a contribution towards finding answers to these questions.

show abstract

SeedLing: Building and Using a Seed corpus for the Human Language Project

Cited by 10 publications

References 10 publications

Decoding Anagrammed Texts Written in an Unknown Language and Script

Decoding Anagrammed Texts Written in an Unknown Language and Script

USAAR-CHRONOS: Crawling the Web for Temporal Annotations

Experimental Analysis of the Dorabella Cipher with Statistical Language Models

Contact Info

Product

Resources

About