Comparing web-crawled and traditional corpora

Cvrček, Václav; Komrsková, Zuzana; Lukeš, David; Poukarová, Petra; Řehořková, Anna; Zasina, Adrian Jan; Бенко, Владимир

doi:10.1007/s10579-020-09487-4

Cited by 7 publications

(7 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Previous work has shown that register is a core property of a number of individual languages when viewed in isolation. We know that languages for which we have abundant corpora contain a large number of distinct varieties defined by their context of production: for instance, Czech (Cvrček et al 2020), English (Egbert et al 2015), and Portuguese (Sardinha et al 2014). We also know that less-studied languages like Somali also contain distinct registers, thus showing the impact of register variation (Biber, 1995).…”

Section: Discussionmentioning

confidence: 99%

“…This present paper approaches web data as a macro-register for the purpose of contextualizing the main registers of interest. Another recent line of work leverages the multi-dimensional approach to analyze and compare two Czech corpora: a carefully designed corpus and an opportunistic web-crawled corpus (Cvrček et al 2020). The results show that traditional corpora provide a wider range of registers than web-crawled corpora, a somewhat different finding from other work on the complex registers found in web data (Egbert et al 2015).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Register variation remains stable across 60 languages

Dunn

Nini

2022

Corpus Linguistics and Linguistic Theory

View full text Add to dashboard Cite

This paper measures the stability of cross-linguistic register variation. A register is a variety of a language that is associated with extra-linguistic context. The relationship between a register and its context is functional: the linguistic features that make up a register are motivated by the needs and constraints of the communicative situation. This view hypothesizes that register should be universal, so that we expect a stable relationship between the extra-linguistic context that defines a register and the sets of linguistic features which the register contains. In this paper, the universality and robustness of register variation is tested by comparing variation within versus between register-specific corpora in 60 languages using corpora produced in comparable communicative situations: tweets and Wikipedia articles. Our findings confirm the prediction that register variation is, in fact, universal.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Register variation remains stable across 60 languages

Dunn

Nini

2022

Corpus Linguistics and Linguistic Theory

View full text Add to dashboard Cite

show abstract

“…Václav et al compared two corpora of Czech. One was a traditional corpus and the other was a Web-crawled corpus, which had been extensively compared and analyzed for quality [19].…”

Section: Guokun Et Al Automatically Built a Corpus By Crawling Langua...mentioning

confidence: 99%

“…Corpus linguistics [3] Japanese-Chinese bilingual corpora [1], TED talks, [4,5] Web-crawled corpora [7][8][9][10][11]13,14,16,17,19,24] Other corpora [6,12,18,[20][21][22] Corpus augmentation [15,23,[25][26][27] The above related research showed that corpora play an important role in improving translation accuracy and in other directions of language processing. Thus, the construction of a Japanese-Chinese bilingual corpus for NMT has significant implications for the resource scarcity problem.…”

Section: Classification Related Workmentioning

confidence: 99%

WCC-JC: A Web-Crawled Corpus for Japanese-Chinese Neural Machine Translation

Zhang

Tian²,

Mao

et al. 2022

Applied Sciences

View full text Add to dashboard Cite

Currently, there are only a limited number of Japanese–Chinese bilingual corpora of a sufficient amount that can be used as training data for neural machine translation (NMT). In particular, there are few corpora that include spoken language such as daily conversation. In this research, we attempt to construct a Japanese–Chinese bilingual corpus of a certain scale by crawling the subtitle data of movies and TV series from the websites. We calculated the BLEU scores of the constructed WCC-JC (Web Crawled Corpus—Japanese and Chinese) and the other compared corpora. We also manually evaluated the translation results using the translation model trained on the WCC-JC to confirm the quality and effectiveness.

show abstract

“…Recent work has shown that the impact of register variation exceeds the impact of geographic variation in many cases (Dunn, 2021). The result of register variation is that large corpora often contain a number of distinct sub-corpora, each with their own unique patterns of usage (Sardinha, 2018;Cvrček et al, 2020). In other words, a gigaword web-crawled corpus is not simply a flat collection of many written documents: there is, instead, a register-based grouping of sub-corpora which often contain significantly different linguistic forms.…”

Section: Exposure and Convergencementioning

confidence: 99%

Learned Construction Grammars Converge Across Registers Given Increased Exposure

Dunn

Madabushi

2021

Proceedings of the 25th Conference on Computational Natural Language Learning

View full text Add to dashboard Cite

This paper measures the impact of increased exposure on whether learned construction grammars converge onto shared representations when trained on data from different registers. Register influences the frequency of constructions, with some structures common in formal but not informal usage. We expect that a grammar induction algorithm exposed to different registers will acquire different constructions. To what degree does increased exposure lead to the convergence of register-specific grammars? The experiments in this paper simulate language learning in 12 languages (half Germanic and half Romance) with corpora representing three registers (Twitter, Wikipedia, Web). These simulations are repeated with increasing amounts of exposure, from 100k to 2 million words, to measure the impact of exposure on the convergence of grammars. The results show that increased exposure does lead to converging grammars across all languages. In addition, a shared core of register-universal constructions remains constant across increasing amounts of exposure.

show abstract

Comparing web-crawled and traditional corpora

Cited by 7 publications

References 13 publications

Register variation remains stable across 60 languages

Register variation remains stable across 60 languages

WCC-JC: A Web-Crawled Corpus for Japanese-Chinese Neural Machine Translation

Learned Construction Grammars Converge Across Registers Given Increased Exposure

Contact Info

Product

Resources

About