Mapping languages: the Corpus of Global Language Use

Dunn, Jonathan

doi:10.1007/s10579-020-09489-2

Cited by 20 publications

(18 citation statements)

References 17 publications

(22 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While we expected some accuracy loss due to the domain mismatch between clean training data and noisy web text (Dunn, 2020), even after document-consistency filtering the LangID labels were so noisy that the corpora for the majority of languages in our crawl were unusable for any practical NLP task. Table 2 presents some representative samples of noise.…”

Section: Failure Modes Of Langid Models On Web Textmentioning

confidence: 99%

“…Naturally, LangID systems have been applied to web crawls before: Buck et al (2014) published n-gram language models for 175 languages based on Common Crawl data. The Corpora Collection at Leipzig University (Goldhahn et al, 2012) and the Corpus of Global Language Use (Dunn, 2020) offer corpora in 252 and 148 languages. The largest language coverage is probably An Crúbadán, which does not leverage LangID, and found (small amounts of) web data in about 2,000 languages (Scannell, 2007).…”

Section: Previous Implementationsmentioning

confidence: 99%

See 1 more Smart Citation

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Caswell

Breiner

Esch

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for more robust evaluation. Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters (for which we release curated lists in about 500 languages) and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5% to 71.2%. These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.

show abstract

Section: Failure Modes Of Langid Models On Web Textmentioning

confidence: 99%

Section: Previous Implementationsmentioning

confidence: 99%

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Caswell

Breiner

Esch

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…A similar framework of LanguageCrawl [69] filtered out language-specific webpages from CCC using CLD2 to develop Word2Vec language models of various languages. In addition, Parvaz and Megerdoomian [51] developed multilingual corpora of 148 languages and 423 billion tokens from CCC. Despite the need for high compute and storage, few efforts have been made to crawl the World-Wide-Web (WWW) to develop text corpora of low-resource languages.…”

Section: Related Workmentioning

confidence: 99%

Corpulyzer: A Novel Framework for Building Low Resource Language Corpora

Tahir

Mehmood

2021

IEEE Access

View full text Add to dashboard Cite

“…This dataset is summarized in Table 1. The corpus contains the same amount of data per register per language (Dunn, 2020;Dunn and Adams, 2020).…”

Section: Experimental Designmentioning

confidence: 99%

Learned Construction Grammars Converge Across Registers Given Increased Exposure

Dunn

Madabushi

2021

Proceedings of the 25th Conference on Computational Natural Language Learning

Self Cite

View full text Add to dashboard Cite

This paper measures the impact of increased exposure on whether learned construction grammars converge onto shared representations when trained on data from different registers. Register influences the frequency of constructions, with some structures common in formal but not informal usage. We expect that a grammar induction algorithm exposed to different registers will acquire different constructions. To what degree does increased exposure lead to the convergence of register-specific grammars? The experiments in this paper simulate language learning in 12 languages (half Germanic and half Romance) with corpora representing three registers (Twitter, Wikipedia, Web). These simulations are repeated with increasing amounts of exposure, from 100k to 2 million words, to measure the impact of exposure on the convergence of grammars. The results show that increased exposure does lead to converging grammars across all languages. In addition, a shared core of register-universal constructions remains constant across increasing amounts of exposure.

show abstract

Mapping languages: the Corpus of Global Language Use

Cited by 20 publications

References 17 publications

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Corpulyzer: A Novel Framework for Building Low Resource Language Corpora

Learned Construction Grammars Converge Across Registers Given Increased Exposure

Contact Info

Product

Resources

About