2009
DOI: 10.1007/s10579-009-9081-4
|View full text |Cite
|
Sign up to set email alerts
|

The WaCky wide web: a collection of very large linguistically processed web-crawled corpora

Abstract: This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Italian built by web crawling, and describes the methodology and tools used in their construction. The corpora contain more than a billion words each, and are thus among the largest resources for the respective languages. The paper also provides an evaluation of their suitability for linguistic research, focusing on ukWaC and itWaC. A comparison in terms of lexical coverage with existing resources for the languages… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
539
0
14

Year Published

2011
2011
2019
2019

Publication Types

Select...
3
3
3

Relationship

1
8

Authors

Journals

citations
Cited by 751 publications
(578 citation statements)
references
References 14 publications
2
539
0
14
Order By: Relevance
“…In fact, the encyclopedic nature of Wikipedia has been exploited in a wide variety of works (Ponzetto and Strube, 2007;Flati et al, 2016;Gupta et al, 2016), and differs substantially from the web-based corpus we put forward here. As source corpus for the Italian subtask (1B) we instead used the 1.3-billion-word itWac corpus 7 (Baroni et al, 2009), extracted from different sources of the web within the .it domain. Finally, as source corpus for the Spanish subtask (1C) we considered the 1.8-billion-word Spanish corpus 8 (Cardellino, 2016), which also contains heterogeneous documents from different sources.…”
Section: Corpus Compilationmentioning
confidence: 99%
“…In fact, the encyclopedic nature of Wikipedia has been exploited in a wide variety of works (Ponzetto and Strube, 2007;Flati et al, 2016;Gupta et al, 2016), and differs substantially from the web-based corpus we put forward here. As source corpus for the Italian subtask (1B) we instead used the 1.3-billion-word itWac corpus 7 (Baroni et al, 2009), extracted from different sources of the web within the .it domain. Finally, as source corpus for the Spanish subtask (1C) we considered the 1.8-billion-word Spanish corpus 8 (Cardellino, 2016), which also contains heterogeneous documents from different sources.…”
Section: Corpus Compilationmentioning
confidence: 99%
“…In both tests, we used as training corpora the TASA corpus (Zeno, Ivens, Millard, & Duvvuri, 1995) and a random subsample of ukWaC corpus (Baroni, Bernardini, Ferraresi, & Zanchetta, 2009). The TASA corpus is a commonly used linguistic corpus consisting of 37k educational texts with a corpus size of 5M words in its cleaned form.…”
Section: Corporamentioning
confidence: 99%
“…We want to thank the teams behind the TASA (Zeno et al, 1995), WaCky (Baroni et al, 2009) and Dreambank (Domhoff & Schneider, 2008b) projects for providing us the corpora. This research was supported by Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Universidad de Buenos Aires, and Agencia Nacional de Promoción Científica y Tecnológica.…”
Section: Acknowledgmentsunclassified
“…Multiple pipelines for building web corpora were described in many papers in the last decade (Baroni et al, 2009;Ljubešić and Erjavec, 2011;Schäfer and Bildhauer, 2012), but, to the best of our knowledge, only one pipeline is freely available as a complete, ready-to-use tool: the Brno pipeline (Suchomel and Pomikálek, 2012), consisting of the SpiderLing crawler 5 , the Chared encoding detector 6 , the jusText content extractor 7 and the Onion near-deduplicator 8 . Although we have our own pipeline set up (this is the pipeline the first versions of hrWaC and slWaC were built with), we decided to compile these versions of web corpora with the Brno pipeline for two reasons: 1. to inspect the pipeline's capabilities, and 2. to extend the Croatian web corpus as much as possible by using a different crawler.…”
Section: Related Workmentioning
confidence: 99%