2001
DOI: 10.1075/ijcl.6.1.05kil
|View full text |Cite
|
Sign up to set email alerts
|

Comparing Corpora

Abstract: Corpus linguistics lacks strategies for describing and comparing corpora. Currently most descriptions of corpora are textual, and questions such as ‘what sort of a corpus is this?’, or ‘how does this corpus compare to that?’ can only be answered impressionistically. This paper considers various ways in which different corpora can be compared more objectively. First we address the issue, ‘which words are particularly characteristic of a corpus?’, reviewing and critiquing the statistical methods which have been … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

3
193
0
11

Year Published

2007
2007
2022
2022

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 318 publications
(217 citation statements)
references
References 0 publications
3
193
0
11
Order By: Relevance
“…Work in corpus linguistics [11] compares text similarity metrics on a corpus scale. This work introduces a χ 2 -test based model of corpus similarity and compares it with the probabilistic similarity measures perplexity [5] and mutual information [4].…”
Section: Related Workmentioning
confidence: 99%
“…Work in corpus linguistics [11] compares text similarity metrics on a corpus scale. This work introduces a χ 2 -test based model of corpus similarity and compares it with the probabilistic similarity measures perplexity [5] and mutual information [4].…”
Section: Related Workmentioning
confidence: 99%
“…This is important if the text classification task involves authorship attribution for forensic purposes [2]. The similarity metric used is the chi-by-degrees-of-freedom statistic suggested for the calculation of corpus homogeneity in the past by Kilgarriff, using word-level tokenization [3]. This essentially means calculating the χ 2 statistic for each token in the pair of files under consideration, and averaging that over the total number of tokens considered.…”
Section: Background and Methodsmentioning
confidence: 99%
“…Given these parameters (five categories, ten files each), six out of ten files must be assigned to a category on the basis of similarity for it to be deemed significantly homogeneous. 3 Only the categories associated with Cheney (9.473), Lieberman (7.357) and the Moderator (7.723) are significantly homogeneous. The confusion matrix associated with the assignment of files that did not fit into its a priori category is can be summarized as follows: the Cheney and Lieberman categories attract the files associated with the each of other categories (the Moderator is nearly equivalent in homogeneity to Lieberman, but is not an attractor at all).…”
Section: Experiments 1-speakers Define Categoriesmentioning
confidence: 99%
“…Next follows a very useful discussion on how to compare a web corpus with other corpora, either compiled from the web or created using traditional sampling methods. In this discussion, the authors draw on Kilgarriff (2001) and argue that rather than using hypothesis testing for the purpose of trying to determine if two corpora correspond to samples from the same population, it is more informative to use the test statistic (for example, χ 2 ) to assess the relative similarity between corpora, much like the use of collocation measures. A related way of characterising a corpus is to use lists of keywords, for example, ratios of relative frequencies of words.…”
mentioning
confidence: 99%