Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages 2014
DOI: 10.3115/v1/w14-2211
|View full text |Cite
|
Sign up to set email alerts
|

SeedLing: Building and Using a Seed corpus for the Human Language Project

Abstract: A broad-coverage corpus such as the Human Language Project envisioned by Abney and Bird (2010) would be a powerful resource for the study of endangered languages. Existing corpora are limited in the range of languages covered, in standardisation, or in machine-readability. In this paper we present SeedLing, a seed corpus for the Human Language Project. We first survey existing efforts to compile cross-linguistic resources, then describe our own approach. To build the foundation text for a Universal Corpus, we … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
5
0

Year Published

2014
2014
2024
2024

Publication Types

Select...
4
2
1

Relationship

3
4

Authors

Journals

citations
Cited by 10 publications
(5 citation statements)
references
References 10 publications
0
5
0
Order By: Relevance
“…We now directly evaluate the three methods described above by applying them to a set of ciphertexts from different languages. We adapted the dataset created by Emerson et al (2014) from the text of the Universal Declaration of Human Rights (UDHR) in 380 languages. 2 The average length of the texts is 1710 words and 11073 characters.…”
Section: Discussionmentioning
confidence: 99%
“…We now directly evaluate the three methods described above by applying them to a set of ciphertexts from different languages. We adapted the dataset created by Emerson et al (2014) from the text of the Universal Declaration of Human Rights (UDHR) in 380 languages. 2 The average length of the texts is 1710 words and 11073 characters.…”
Section: Discussionmentioning
confidence: 99%
“…In this paper, we have described our submission to the Diachronic Text Evaluation for SemEval-2015. 5 The cleaning tool used is a compilation of web cleaning scripts (Emerson et al, 2014;Tan et al, 2014b;Tan and Bond, 2011) We have adapted a web crawler to search for the source of the text snippets used for the evaluation and achieved the highest precision score. Additionally, we have crawled and cleaned the source articles of the snippets and produced the Daikon corpus that can be used for future research in diachronic/temporal analysis and epoch identification.…”
Section: Discussionmentioning
confidence: 99%
“…We chose Dangerous Connections, an English translation of an epistolary novel, for deriving character-level language models; and a much larger New York Times Corpus 5 for deriving word-level language models. For our language identification e xperiments, w e u se a d ataset constructed from 380 translations of the Universal Declaration of Human Rights (UDHR) (Emerson et al, 2014), and the multilingual OpenSubtitles corpus of movie subtitles (Lison and Tiedemann, 2016).…”
Section: Music Deciphermentmentioning
confidence: 99%