Proceedings of the Workshop on Comparing Corpora - 2000
DOI: 10.3115/1117729.1117730
|View full text |Cite
|
Sign up to set email alerts
|

Comparing corpora using frequency profiling

Abstract: This paper describes a method of comparing corpora which uses frequency profiling. The method can be used to discover key words in the corpora which differentiate one corpus from another. Using annotated corpora, it can be applied to discover key grammatical or word-sense categories. This can be used as a quick way in to find the differences between the corpora and is shown to have applications in the study of social differentiation in the use of English vocabulary, profiling of learner English and document an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
192
0
7

Year Published

2007
2007
2022
2022

Publication Types

Select...
6
2
1

Relationship

2
7

Authors

Journals

citations
Cited by 305 publications
(220 citation statements)
references
References 12 publications
3
192
0
7
Order By: Relevance
“…We consider positive key semantic tags, or those 'overused' in the target ICTY Trials and Appeals corpus, as opposed to negative domains are 'underused' in comparison to a reference corpus. This is measured using the log likelihood procedure [42], which demonstrates confidence of significance.…”
Section: Methods and Toolsmentioning
confidence: 99%
“…We consider positive key semantic tags, or those 'overused' in the target ICTY Trials and Appeals corpus, as opposed to negative domains are 'underused' in comparison to a reference corpus. This is measured using the log likelihood procedure [42], which demonstrates confidence of significance.…”
Section: Methods and Toolsmentioning
confidence: 99%
“…Secondly, we use a log likelihood model as given in Eq. 4 (Rayson et al 2000). This algorithm compares two corpora, in our case a specific piece of text and the background collection, and ranks highly the words Wikipedia have the most significant relative frequency difference between the two corpora.…”
Section: Varying the Term Selection Algorithmmentioning
confidence: 99%
“…Latent semantic analysis and latent Dirichlet allocation outperform a baseline of TF-IDF on an automated foldering and a recipient prediction task. Rayson et al (2000) proposes a method to compare different corpora using frequency profiling, which could also be used to generate terms for word clouds. Their goal is to discover keywords that differentiate one corpus from another.…”
Section: Related Workmentioning
confidence: 99%
“…We followed Rayson and Garside's (2000) formula to calculate this loglikelihood: Given the frequency a of a word in Corpus 1 (i.e., DWDD), its frequency b in Corpus 2 (i.e., Pauw), the total length in words of Corpus 1 c, and the total length in words of Corpus 2 d, the expected frequency of the word in Corpus 1 can be calculated as follows: E1 = c (a+b)/(c+d) and its expected frequency in Corpus 2 as:…”
Section: Discussionmentioning
confidence: 99%