Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations

Davani, Aida Mostafazadeh; Díaz, Mark; Prabhakaran, Vinodkumar

doi:10.1162/tacl_a_00449

Cited by 89 publications

(74 citation statements)

References 68 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…merge two labels into one or separate one label into two or more) is another direction for future work. There has been recent work in dealing with bias in annotation [2]. Having an automated assistant to ensure consistent annotations could be a way to avoid bias.…”

Section: Resultsmentioning

confidence: 99%

“…Annotations can range from a fixed set of categorical labels that are associated 1-to-1 with data items, to sequential labels that may have order constraints, to complex, multifaceted structure [1], [2], [11], [12]. More recently, captioning tasks involve associating unstructured descriptions as annotations of data.…”

Section: Complex Annotation Tasks and Automationmentioning

confidence: 99%

“…In scenarios where there are multiple annotators, one can assess and model the agreement among annotators [1], [2], which minimizes the effect of annotator uncertainty, inconsistency, and other sources of errors like data or label ambiguity. In these cases, we can be reasonably certain about the ground truth of the annotation or at least the margin of inter-annotator agreement.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

BERT-Assisted Semantic Annotation Correction for Emotion-Related Questions

Kazemzadeh¹

2022

Preprint

View full text Add to dashboard Cite

Annotated data have traditionally been used to provide the input for training a supervised machine learning (ML) model. However, current pre-trained ML models for natural language processing (NLP) contain embedded linguistic information that can be used to inform the annotation process. We use the BERT neural language model to feed information back into an annotation task that involves semantic labelling of dialog behavior in a question-asking game called Emotion Twenty Questions (EMO20Q). First we describe the background of BERT, the EMO20Q data, and assisted annotation tasks. Then we describe the methods for fine-tuning BERT for the purpose of checking the annotated labels. To do this, we use the paraphrase task as a way to check that all utterances with the same annotation label are classified as paraphrases of each other. We show this method to be an effective way to assess and revise annotations of textual user data with complex, utterance-level semantic labels.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Complex Annotation Tasks and Automationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

BERT-Assisted Semantic Annotation Correction for Emotion-Related Questions

Kazemzadeh¹

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Suresh and Guttag [190] define this bias as a positive value for a measure of divergence between the probability distribution over the input space and the true distribution, noting that it can occur simply as a result of random sampling from a distribution where some groups are in the minority. Others point to the potential for overlooked errors in the labeling process, which is often left undescribed in research papers [73], to lead to overfitting even in the absence of other types of noise [35,152], and the way that data preparation can be lossy whenever majority-rule is used to construct ground truth without preserving information about label distributions [54,93].…”

Section: Data Collection and Preparationmentioning

confidence: 99%

“…• High and unmodeled measurement error [14,59,134] • Data transformations decided contingent on (NHST) results [83,181] • Non-representative [105,143] or underdefined subject samples [88]; insufficient stimuli sampling [87,207,212] • Small samples and noisy measurements (low power) leading to biased estimates [40] • Differential measurement error [39,156,190,216]; unmodeled measurement error [119,127] • Label errors [35,152] and disagreement [54,93] • Data transformations decided contingent on performance comparisons [130] • Underrepresentation of portions of input space in training data [13,157,190] • Input data too huge to understand [19,157] Model representation…”

Section: Data Selection and Preparationmentioning

confidence: 99%

The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning

Hullman,

Kapoor,

Nanayakkara

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent concerns that machine learning (ML) may be facing a reproducibility and replication crisis suggest that some published claims in ML research cannot be taken at face value. These concerns inspire analogies to the replication crisis affecting the social and medical sciences, as well as calls for greater integration of statistical approaches to causal inference and predictive modeling. A deeper understanding of what reproducibility concerns in researchin supervised ML have in common with the replication crisis in experimental science can put the new concerns in perspective, and help researchers avoid "the worst of both worlds" that can emerge when ML researchers begin borrowing methodologies from explanatory modeling without understanding their limitations, and vice versa. We contribute a comparative analysis of concerns about inductive learning that arise in different stages of the modeling pipeline in causal attribution as exemplified in psychology versus predictive modeling as exemplified by ML. We identify themes that re-occur in reform discussions like overreliance on asymptotic theory and non-credible beliefs about real-world data generating processes. We argue that in both fields, claims from learning are implied to generalize outside the specific environment studied (e.g., the input dataset or subject sample, modeling implementation, etc.) but are often impossible to refute due to forms of underspecification. In particular, many errors being acknowledged in ML expose cracks in long-held beliefs that optimizing predictive accuracy using huge datasets absolves one from having to make assumptions about the underlying data generating process. We conclude by discussing rhetorical risks like error misdiagnosis that arise in times of methodological uncertainty. CCS CONCEPTS• Computing methodologies → Learning paradigms; Supervised learning.

show abstract