2021
DOI: 10.1111/cogs.12983
|View full text |Cite
|
Sign up to set email alerts
|

The Challenges of Large‐Scale, Web‐Based Language Datasets: Word Length and Predictability Revisited

Abstract: Language research has come to rely heavily on large-scale, web-based datasets. These datasets can present significant methodological challenges, requiring researchers to make a number of decisions about how they are collected, represented, and analyzed. These decisions often concern long-standing challenges in corpus-based language research, including determining what counts as a word, deciding which words should be analyzed, and matching sets of words across languages. We illustrate these challenges by revisi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

6
22
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 13 publications
(44 citation statements)
references
References 55 publications
6
22
0
Order By: Relevance
“…Each corpus contained 10 million sentences collected from diverse online sources and randomly shuffled by the corpus creators. The UTF-8 encoding was used, in accordance with [5], to make sure that strings like Spanish si 'if' and sí 'yes' are treated as different words.…”
Section: Methodsmentioning
confidence: 99%
See 4 more Smart Citations
“…Each corpus contained 10 million sentences collected from diverse online sources and randomly shuffled by the corpus creators. The UTF-8 encoding was used, in accordance with [5], to make sure that strings like Spanish si 'if' and sí 'yes' are treated as different words.…”
Section: Methodsmentioning
confidence: 99%
“…Uppercase characters were converted to lowercase. After that, the unigrams were cleaned: following the filtering methods in [5], I retained only those strings that occurred in the corresponding OpenSubtitles corpus 1 [25] and in the Hunspell dictionary (with the exception of Finnish, for which the Microsoft spellchecker was used). The removed strings represented words written in a different language (most importantly, English), misspellings, proper names, acronyms and punctuation marks.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations