The WaCky wide web: a collection of very large linguistically processed web-crawled corpora

Baroni, Marco; Bernardini, Silvia; Ferraresi, Adriano; Zanchetta, Eros

doi:10.1007/s10579-009-9081-4

Cited by 751 publications

(578 citation statements)

References 14 publications

Supporting

Mentioning

539

Contrasting

Unclassified

Order By: Relevance

“…In fact, the encyclopedic nature of Wikipedia has been exploited in a wide variety of works (Ponzetto and Strube, 2007;Flati et al, 2016;Gupta et al, 2016), and differs substantially from the web-based corpus we put forward here. As source corpus for the Italian subtask (1B) we instead used the 1.3-billion-word itWac corpus 7 (Baroni et al, 2009), extracted from different sources of the web within the .it domain. Finally, as source corpus for the Spanish subtask (1C) we considered the 1.8-billion-word Spanish corpus 8 (Cardellino, 2016), which also contains heterogeneous documents from different sources.…”

Section: Corpus Compilationmentioning

confidence: 99%

SemEval-2018 Task 9: Hypernym Discovery

Camacho-Collados¹,

Bovi²,

Espinosa-Anke³

et al. 2018

Proceedings of the 12th International Workshop on Semantic Evaluation

View full text Add to dashboard Cite

This paper describes the SemEval 2018 Shared Task on Hypernym Discovery. We put forward this task as a complementary benchmark for modeling hypernymy, a problem which has traditionally been cast as a binary classification task, taking a pair of candidate words as input. Instead, our reformulated task is defined as follows: given an input term, retrieve (or discover) its suitable hypernyms from a target corpus. We proposed five different subtasks covering three languages (English, Spanish, and Italian), and two specific domains of knowledge in English (Medical and Music). Participants were allowed to compete in any or all of the subtasks. Overall, a total of 11 teams participated, with a total of 39 different systems submitted through all subtasks. Data, results and further information about the task can be found at https://competitions. codalab.org/competitions/17119.

show abstract

Section: Corpus Compilationmentioning

confidence: 99%

SemEval-2018 Task 9: Hypernym Discovery

Camacho-Collados¹,

Bovi²,

Espinosa-Anke³

et al. 2018

Proceedings of the 12th International Workshop on Semantic Evaluation

View full text Add to dashboard Cite

show abstract

“…In both tests, we used as training corpora the TASA corpus (Zeno, Ivens, Millard, & Duvvuri, 1995) and a random subsample of ukWaC corpus (Baroni, Bernardini, Ferraresi, & Zanchetta, 2009). The TASA corpus is a commonly used linguistic corpus consisting of 37k educational texts with a corpus size of 5M words in its cleaned form.…”

Section: Corporamentioning

confidence: 99%

“…We want to thank the teams behind the TASA (Zeno et al, 1995), WaCky (Baroni et al, 2009) and Dreambank (Domhoff & Schneider, 2008b) projects for providing us the corpora. This research was supported by Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Universidad de Buenos Aires, and Agencia Nacional de Promoción Científica y Tecnológica.…”

Section: Acknowledgmentsunclassified

The interpretation of dream meaning: Resolving ambiguity using Latent Semantic Analysis in a small corpus of text

Altszyler

Ribeiro

Sigman

et al. 2017

Consciousness and Cognition

View full text Add to dashboard Cite

A B S T R A C TComputer-based dreams content analysis relies on word frequencies within predefined categories in order to identify different elements in text. As a complementary approach, we explored the capabilities and limitations of word-embedding techniques to identify word usage patterns among dream reports. These tools allow us to quantify words associations in text and to identify the meaning of target words. Word-embeddings have been extensively studied in large datasets, but only a few studies analyze semantic representations in small corpora. To fill this gap, we compared Skip-gram and Latent Semantic Analysis (LSA) capabilities to extract semantic associations from dream reports. LSA showed better performance than Skip-gram in small size corpora in two tests. Furthermore, LSA captured relevant word associations in dream collection, even in cases with low-frequency words or small numbers of dreams. Word associations in dreams reports can thus be quantified by LSA, which opens new avenues for dream interpretation and decoding.

show abstract

“…Multiple pipelines for building web corpora were described in many papers in the last decade (Baroni et al, 2009;Ljubešić and Erjavec, 2011;Schäfer and Bildhauer, 2012), but, to the best of our knowledge, only one pipeline is freely available as a complete, ready-to-use tool: the Brno pipeline (Suchomel and Pomikálek, 2012), consisting of the SpiderLing crawler 5 , the Chared encoding detector 6 , the jusText content extractor 7 and the Onion near-deduplicator 8 . Although we have our own pipeline set up (this is the pipeline the first versions of hrWaC and slWaC were built with), we decided to compile these versions of web corpora with the Brno pipeline for two reasons: 1. to inspect the pipeline's capabilities, and 2. to extend the Croatian web corpus as much as possible by using a different crawler.…”

Section: Related Workmentioning

confidence: 99%

{bs,hr,sr}WaC - Web Corpora of Bosnian, Croatian and Serbian

Ljubešić¹,

Klubička²

2014

Proceedings of the 9th Web as Corpus Workshop (WaC-9)

View full text Add to dashboard Cite

In this paper we present the construction process of top-level-domain web corpora of Bosnian, Croatian and Serbian. For constructing the corpora we use the SpiderLing crawler with its associated tools adapted for simultaneous crawling and processing of text written in two scripts, Latin and Cyrillic. In addition to the modified collection process we focus on two sources of noise in the resulting corpora: 1. they contain documents written in the other, closely related languages that can not be identified with standard language identification methods and 2. as most web corpora, they partially contain low-quality data not suitable for the specific research and application objectives. We approach both problems by using language modeling on the crawled data only, omitting the need for manually validated language samples for training. On the task of discriminating between closely related languages we outperform the state-of-the-art Blacklist classifier reducing its error to a fourth.

show abstract

The WaCky wide web: a collection of very large linguistically processed web-crawled corpora

Cited by 751 publications

References 14 publications

SemEval-2018 Task 9: Hypernym Discovery

SemEval-2018 Task 9: Hypernym Discovery

The interpretation of dream meaning: Resolving ambiguity using Latent Semantic Analysis in a small corpus of text

{bs,hr,sr}WaC - Web Corpora of Bosnian, Croatian and Serbian

Contact Info

Product

Resources

About