Inter-annotator Agreement

Artstein, Ron

doi:10.1007/978-94-024-0881-2_11

Cited by 96 publications

(66 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Point 2: Reporting IAA studies With regard to point 2, much has been said in previous works. Because presenting a detailed report of those works is beyond the scope of this paper, we refer to Krippendorff (1980), Lombard et al (2002), Artstein and Poesio (2008), LeBreton and Senter (2008), Kottner et al (2011) and Artstein (2017), where guidelines and good practice descriptions for applying IAA have been developed. Based on our research, the following shortcomings have been identified.…”

Section: Years Of Iaa In Evaluation Of Nlg Systemsmentioning

confidence: 99%

“…The reduction of a statistical test interpretation to a simple number, whilst common, can be arbitrary and accordingly give us little information 8 . For example, Artstein (2017) show that a single label is not sufficient to give a deep understanding of the reliability of an annotation. In this paper, we do not face the problem of how to interpret IAA, rather, we try to tackle the prob-lem of data reliability by suggesting that correlation coefficients and agreement coefficients should be used together to obtain a better assessment of the evaluation data reliability.…”

Section: Years Of Iaa In Evaluation Of Nlg Systemsmentioning

confidence: 99%

See 1 more Smart Citation

Agreement is overrated: A plea for correlation to assess human evaluation reliability

Amidei¹,

Piwek²,

Willis³

2019

Proceedings of the 12th International Conference on Natural Language Generation

View full text Add to dashboard Cite

Inter-Annotator Agreement (IAA) is used as a means of assessing the quality of NLG evaluation data, in particular, its reliability. According to existing scales of IAA interpretationsee, for example, Lommel et al. (2014), Liu et al. (2016), Sedoc et al. (2018) and Amidei et al. (2018a)-most data collected for NLG evaluation fail the reliability test. We confirmed this trend by analysing papers published over the last 10 years in NLG-specific conferences (in total 135 papers that included some sort of human evaluation study). Following Sampson and Babarczy (2008), Lommel et al. (2014), Joshi et al. (2016) and Amidei et al. (2018b), such phenomena can be explained in terms of irreducible human language variability. Using three case studies, we show the limits of considering IAA as the only criterion for checking evaluation reliability. Given human language variability, we propose that for human evaluation of NLG, correlation coefficients and agreement coefficients should be used together to obtain a better assessment of the evaluation data reliability. This is illustrated using the three case studies.

show abstract

Section: Years Of Iaa In Evaluation Of Nlg Systemsmentioning

confidence: 99%

Section: Years Of Iaa In Evaluation Of Nlg Systemsmentioning

confidence: 99%

Agreement is overrated: A plea for correlation to assess human evaluation reliability

Amidei¹,

Piwek²,

Willis³

2019

Proceedings of the 12th International Conference on Natural Language Generation

View full text Add to dashboard Cite

show abstract

“…Our first attempt is to maximize the mutual information between the predictions of p and q. Intuitively, this encourages p and q to agree on some annotation scheme (up to a permutation of labels), modeling the dynamics of inter-annotator agreement (Artstein, 2017). It can be seen as a differentiable generalization of the Brown clustering objective.…”

Section: Generalized Brown Objectivementioning

confidence: 99%

Mutual Information Maximization for Simple and Accurate Part-Of-Speech Induction

Stratos¹

2019

Proceedings of the 2019 Conference of the North

View full text Add to dashboard Cite

We address part-of-speech (POS) induction by maximizing the mutual information between the induced label and its context. We focus on two training objectives that are amenable to stochastic gradient descent (SGD): a novel generalization of the classical Brown clustering objective and a recently proposed variational lower bound. While both objectives are subject to noise in gradient updates, we show through analysis and experiments that the variational lower bound is robust whereas the generalized Brown objective is vulnerable. We obtain strong performance on a multitude of datasets and languages with a simple architecture that encodes morphology and context.

show abstract

“…The annotation was performed at sentence level and the prescribed labels were accepted using inter-annotator agreement. Inter-annotator agreement is a measure of how well two (or more) annotators make the same annotation decision for a certain label in the entire corpus [39]. We measured the interannotation agreement of the three annotators using Cohen's kappa coefficient [40] and found it substantial (kappa = 0.701) for further analysis and improvement in our case study.…”

Section: B Corpus Annotationmentioning

confidence: 93%

SCANCPECLENS: A Framework for Automatic Lexicon Generation and Sentiment Analysis of Micro Blogging Data on China Pakistan Economic Corridor

Bibi

Azim

2019

IEEE Access

View full text Add to dashboard Cite

With the growing availability of internet and opinion rich resources such as social networks and personal blogs, the task of mining public opinion and exploring facts has become more popular than ever before during the last decade. The latest trend has deeply transformed the way the governments interact with their citizens and offer them various services through continuous public engagement. The proposed framework SCANCPECLENS is an initiative to support performance assessment framework for e-government in Pakistan. The research takes into account the opinion of masses on one of the most crucial and widely discussed development projects, China Pakistan Economic Corridor (CPEC), considered as a game changer due to its promise of bringing economic prosperity to the region. The proposed framework suggests to use machine learning algorithms to automatically discover the public sentiment from microblogs on the matter nationally as well as internationally. We also present an automated way to create sentiment lexicon of positive, negative and neutral words on the subject. To the best of our knowledge, this theme has not been explored for opinion mining before and helps one in effectively assessing public satisfaction over government's policies in the CPEC region. The research is an initiative to discover new avenues of future research and direction for the government, policy making institutions and other stake holders and demonstrates the power of text mining as an effective tool to extract business value from vast amount of social media data.

show abstract

Inter-annotator Agreement

Cited by 96 publications

References 23 publications

Agreement is overrated: A plea for correlation to assess human evaluation reliability

Agreement is overrated: A plea for correlation to assess human evaluation reliability

Mutual Information Maximization for Simple and Accurate Part-Of-Speech Induction

SCANCPECLENS: A Framework for Automatic Lexicon Generation and Sentiment Analysis of Micro Blogging Data on China Pakistan Economic Corridor

Contact Info

Product

Resources

About