2021
DOI: 10.1007/s10579-021-09551-7
|View full text |Cite
|
Sign up to set email alerts
|

LanguageCrawl: a generic tool for building language models upon common Crawl

Abstract: The exponential growth of the internet community has resulted in the production of a vast amount of unstructured data, including web pages, blogs and social media. Such a volume consisting of hundreds of billions of words is unlikely to be analyzed by humans. In this work we introduce the tool LanguageCrawl, which allows Natural Language Processing (NLP) researchers to easily build web-scale corpora using the Common Crawl Archive—an open repository of web crawl information, which contains petabytes of data. We… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 9 publications
(4 citation statements)
references
References 17 publications
(21 reference statements)
0
4
0
Order By: Relevance
“…This approach resembled the semi-supervised learning (SSL) technique [37,38], in which a large unannotated dataset was assigned labels based on a classifier trained on a much smaller annotated dataset. We took the Polish subset of the Common Crawl archive, called hereinafter pCC , by filtering the whole set with the LanguageCrawl toolkit [39]. It resulted in a few billion web pages with some Polish content.…”
Section: Terabot-therapeutic Spoken Dialogue Systemmentioning
confidence: 99%
“…This approach resembled the semi-supervised learning (SSL) technique [37,38], in which a large unannotated dataset was assigned labels based on a classifier trained on a much smaller annotated dataset. We took the Polish subset of the Common Crawl archive, called hereinafter pCC , by filtering the whole set with the LanguageCrawl toolkit [39]. It resulted in a few billion web pages with some Polish content.…”
Section: Terabot-therapeutic Spoken Dialogue Systemmentioning
confidence: 99%
“…N-Grams are capable of displaying various word clusters ranging in length from 2 words (2-Grams) to 4 words (4-Grams). n-Grams are valuable for various purposes, such as improving the accuracy of speech recognition, spell checking, or machine translation systems (Roziewski & Kozłowski, 2021). Besides presenting a list of word groups, the n-Gram table also provides information on the frequency of occurrence of the word group in the corpus as well as the number of texts containing the word group.…”
Section: The Meaning Of Cacat Difabel and Disabilitasmentioning
confidence: 99%
“…Finally, while larger corpora generally result in better models (Kaplan et al, 2020;Sun et al, 2017), data quality and corpora content also plays a major role in the caliber and appropriateness of these models for the various downstream applications (Florez, 2019;Abid et al, 2021;Bhardwaj et al, 2021). To produce high quality and safe neural language models will likely require the community to adopt more mindful data collection practices (Gehman et al, 2020;Bender and Friedman, 2018;Gebru et al, 2018;Jo and Gebru, 2020;Paullada et al, 2020;Bender et al, 2021), establish standardized filtering pipelines for corpora (Roziewski and Stokowiec, 2016;Ortiz Suarez et al, 2019;Wenzek et al, 2020), and develop methods for evaluating the bias in trained models (Schick et al, 2021). We recognize that this is not a straightforward task with a one-size-fits all solution, but we propose that as much attention should be dedicated to the corpora used for training language models as to the models themselves, and that corpora transparency is a prerequisite for language model accountability.…”
Section: Future Workmentioning
confidence: 99%