2020
DOI: 10.1007/s10579-020-09487-4
|View full text |Cite
|
Sign up to set email alerts
|

Comparing web-crawled and traditional corpora

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(7 citation statements)
references
References 13 publications
0
5
0
Order By: Relevance
“…Previous work has shown that register is a core property of a number of individual languages when viewed in isolation. We know that languages for which we have abundant corpora contain a large number of distinct varieties defined by their context of production: for instance, Czech (Cvrček et al 2020), English (Egbert et al 2015), and Portuguese (Sardinha et al 2014). We also know that less-studied languages like Somali also contain distinct registers, thus showing the impact of register variation (Biber, 1995).…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Previous work has shown that register is a core property of a number of individual languages when viewed in isolation. We know that languages for which we have abundant corpora contain a large number of distinct varieties defined by their context of production: for instance, Czech (Cvrček et al 2020), English (Egbert et al 2015), and Portuguese (Sardinha et al 2014). We also know that less-studied languages like Somali also contain distinct registers, thus showing the impact of register variation (Biber, 1995).…”
Section: Discussionmentioning
confidence: 99%
“…This present paper approaches web data as a macro-register for the purpose of contextualizing the main registers of interest. Another recent line of work leverages the multi-dimensional approach to analyze and compare two Czech corpora: a carefully designed corpus and an opportunistic web-crawled corpus (Cvrček et al 2020). The results show that traditional corpora provide a wider range of registers than web-crawled corpora, a somewhat different finding from other work on the complex registers found in web data (Egbert et al 2015).…”
Section: Related Workmentioning
confidence: 99%
“…Václav et al compared two corpora of Czech. One was a traditional corpus and the other was a Web-crawled corpus, which had been extensively compared and analyzed for quality [19].…”
Section: Guokun Et Al Automatically Built a Corpus By Crawling Langua...mentioning
confidence: 99%
“…Corpus linguistics [3] Japanese-Chinese bilingual corpora [1], TED talks, [4,5] Web-crawled corpora [7][8][9][10][11]13,14,16,17,19,24] Other corpora [6,12,18,[20][21][22] Corpus augmentation [15,23,[25][26][27] The above related research showed that corpora play an important role in improving translation accuracy and in other directions of language processing. Thus, the construction of a Japanese-Chinese bilingual corpus for NMT has significant implications for the resource scarcity problem.…”
Section: Classification Related Workmentioning
confidence: 99%
“…Recent work has shown that the impact of register variation exceeds the impact of geographic variation in many cases (Dunn, 2021). The result of register variation is that large corpora often contain a number of distinct sub-corpora, each with their own unique patterns of usage (Sardinha, 2018;Cvrček et al, 2020). In other words, a gigaword web-crawled corpus is not simply a flat collection of many written documents: there is, instead, a register-based grouping of sub-corpora which often contain significantly different linguistic forms.…”
Section: Exposure and Convergencementioning
confidence: 99%