Comparing corpora using frequency profiling

Rayson, Paul; Garside, Roger

doi:10.3115/1117729.1117730

Cited by 305 publications

(220 citation statements)

References 12 publications

Supporting

Mentioning

192

Contrasting

Unclassified

Order By: Relevance

“…We consider positive key semantic tags, or those 'overused' in the target ICTY Trials and Appeals corpus, as opposed to negative domains are 'underused' in comparison to a reference corpus. This is measured using the log likelihood procedure [42], which demonstrates confidence of significance.…”

Section: Methods and Toolsmentioning

confidence: 99%

Constructing Achievement in the International Criminal Tribunal for the Former Yugoslavia (ICTY): A Corpus-Based Critical Discourse Analysis

Potts

Kjær

2015

Int J Semiot Law

View full text Add to dashboard Cite

The International Criminal Tribunal for Yugoslavia (ICTY) was established by the UN Security Council in 1993 to prosecute persons responsible for war crimes committed in the former Yugoslavia during the Balkan wars. As the first international war crimes tribunal since the Nuremburg and Tokyo tribunals set up after WWII, the ICTY has attracted immense interest among legal scholars since its inception, but has failed to garner the same level of attention from researchers in other disciplines, notably linguistics. This represents a significant research gap, as the Tribunal's public discourse (notably its case law and Annual Reports) can open up interesting avenues of analysis to researchers of law, language, and legal discourse alike. On its official website, the Tribunal claims that it has ''irreversibly changed the landscape of international humanitarian law'' and lists six specific achievements: ''Holding leaders accountable; bringing justice to victims; giving victims a voice; establishing the facts; developing international law and strengthening the rule of the law''. While a number of legal scholars have studied and critiqued the level of 'achievement' actually attained by the Tribunal against these metrics and others, of interest to linguists is the ways in which this work might be conveyed discursively. In this paper, we demonstrate how methods from the linguistic field of corpus-based critical discourse analysis can be utilised to explore the discursive construction of such achievements in the language of the ICTY.

show abstract

Section: Methods and Toolsmentioning

confidence: 99%

Constructing Achievement in the International Criminal Tribunal for the Former Yugoslavia (ICTY): A Corpus-Based Critical Discourse Analysis

Potts

Kjær

2015

Int J Semiot Law

View full text Add to dashboard Cite

show abstract

“…Secondly, we use a log likelihood model as given in Eq. 4 (Rayson et al 2000). This algorithm compares two corpora, in our case a specific piece of text and the background collection, and ranks highly the words Wikipedia have the most significant relative frequency difference between the two corpora.…”

Section: Varying the Term Selection Algorithmmentioning

confidence: 99%

“…Latent semantic analysis and latent Dirichlet allocation outperform a baseline of TF-IDF on an automated foldering and a recipient prediction task. Rayson et al (2000) proposes a method to compare different corpora using frequency profiling, which could also be used to generate terms for word clouds. Their goal is to discover keywords that differentiate one corpus from another.…”

Section: Related Workmentioning

confidence: 99%

Focused retrieval and result aggregation with political data

Kaptein

Marx

2010

Inf Retrieval

View full text Add to dashboard Cite

This paper presents a case-study in which we use a large semi-structured data set consisting of official transcripts of meetings of the Dutch parliament for focused retrieval and result aggregation. Transcripts of meetings are a document genre characterized by a complex narrative structure. The essence is not only what is said, but also by who and to whom. We have notes of more than 40 years of Dutch parliamentary debates where this structure is exploited to automatically make semantic annotations. These annotations yield numerous new ways of searching, browsing, mining and summarizing these documents. Concerning result aggregation, we summarise and visualise the structure of meetings into tables of content and interruption graphs. The contents of meetings or parts of meetings are condensed into word clouds that are created using a parsimonious language model. Furthermore, we have developed a search engine that exploits the structure and annotations of our data making it possible to provide entry points, to group search results, and to use faceted search techniques for data-exploration. Evaluation shows that our content and structure summarization tools provide a good first impression of a debate. Users reported that, compared to a standard document retrieval system, our search engine gives a better overview of the data. Search tasks are performed faster and the users felt more certain of their answers.

show abstract

“…We followed Rayson and Garside's (2000) formula to calculate this loglikelihood: Given the frequency a of a word in Corpus 1 (i.e., DWDD), its frequency b in Corpus 2 (i.e., Pauw), the total length in words of Corpus 1 c, and the total length in words of Corpus 2 d, the expected frequency of the word in Corpus 1 can be calculated as follows: E1 = c (a+b)/(c+d) and its expected frequency in Corpus 2 as:…”

Section: Discussionmentioning

confidence: 99%

Political relevance in the eye of the beholder: Determining the substantiveness of TV shows and political debates with Twitter data

Boukes

Trilling

2017

View full text Add to dashboard Cite

Addressing the call to move beyond a simple genre classification of TV shows as either substantive (hard) news or non-substantive (soft) infotainment, we propose using social media reactions to determine a program’s political relevance. Such an approach provides information that goes beyond genre or content characteristics and reflects what really reaches an audience. Analyzing tweets about two Dutch talk shows and four U.S. primary debates, we show that audience responses to television programs differ considerably regarding their political relevance. Thereby, we demonstrate how examining online audience reactions can be employed as a sophisticated and valid way to assess the political relevance of TV programs.

show abstract

Comparing corpora using frequency profiling

Cited by 305 publications

References 12 publications

Constructing Achievement in the International Criminal Tribunal for the Former Yugoslavia (ICTY): A Corpus-Based Critical Discourse Analysis

Constructing Achievement in the International Criminal Tribunal for the Former Yugoslavia (ICTY): A Corpus-Based Critical Discourse Analysis

Focused retrieval and result aggregation with political data

Political relevance in the eye of the beholder: Determining the substantiveness of TV shows and political debates with Twitter data

Contact Info

Product

Resources

About