Systematic Error Analysis of the Stanford Question Answering Dataset

Rondeau, Marc-Antoine; Hazen, Timothy J.

doi:10.18653/v1/w18-2602

Cited by 26 publications

(19 citation statements)

References 9 publications

(9 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Weissenborn, Wiese, and Seiffe (2017) found a feature indicating if a word appears in the question important, and suggested that questions can be answered with some rules that rely only on superficial features. Rondeau and Hazen (2018) validated the suggestion by a series of systematic experiments. However, the conversational question answering systems have been rarely explored.…”

Section: Related Worksupporting

confidence: 59%

An Empirical Study of Content Understanding in Conversational Question Answering

Chiang

Chen

2020

AAAI

View full text Add to dashboard Cite

With a lot of work about context-free question answering systems, there is an emerging trend of conversational question answering models in the natural language processing field. Thanks to the recently collected datasets, including QuAC and CoQA, there has been more work on conversational question answering, and recent work has achieved competitive performance on both datasets. However, to best of our knowledge, two important questions for conversational comprehension research have not been well studied: 1) How well can the benchmark dataset reflect models' content understanding? 2) Do the models well utilize the conversation content when answering questions? To investigate these questions, we design different training settings, testing settings, as well as an attack to verify the models' capability of content understanding on QuAC and CoQA. The experimental results indicate some potential hazards in the benchmark datasets, QuAC and CoQA, for conversational comprehension research. Our analysis also sheds light on both what models may learn and how datasets may bias the models. With deep investigation of the task, it is believed that this work can benefit the future progress of conversation comprehension. The source code is available at https://github.com/MiuLab/CQA-Study.

show abstract

Section: Related Worksupporting

confidence: 59%

An Empirical Study of Content Understanding in Conversational Question Answering

Chiang

Chen

2020

AAAI

View full text Add to dashboard Cite

show abstract

“…But suppose there's a new subject that wants to challenge Ken; they are not going to reliably dethrone Ken until their skill θ c is greater than six. This is a more mathematical formulation of the "easy" and "hard" dataset splits in question answering (Sugawara et al, 2018;Rondeau and Hazen, 2018;Sen and Saffari, 2020). In IRT-feas, this recapitulates the observation of Boyd-Graber and Börschinger (2020) that annotation error can hinder effective leaderboards.…”

Section: Examples Are Not Equally Usefulmentioning

confidence: 74%

Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards?

Rodríguez¹,

Barrow²,

Alexander³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Leaderboards are widely used in NLP and push the field forward. While leaderboards are a straightforward ranking of NLP models, this simplicity can mask nuances in evaluation items (examples) and subjects (NLP models). Rather than replace leaderboards, we advocate a re-imagining so that they better highlight if and where progress is made. Building on educational testing, we create a Bayesian leaderboard model where latent subject skill and latent item difficulty predict correct responses. Using this model, we analyze the ranking reliability of leaderboards. Afterwards, we show the model can guide what to annotate, identify annotation errors, detect overfitting, and identify informative examples. We conclude with recommendations for future benchmark tasks.

show abstract

“…If you find them boring, repetitive, or uninteresting, so will crowdworkers. If you can find shortcuts to answer questions (Rondeau and Hazen, 2018;Kaushik and Lipton, 2018), so will a computer.…”

Section: Are We Having Fun?mentioning

confidence: 99%

What Question Answering can Learn from Trivia Nerds

Boyd-Graber¹,

Börschinger²

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Question answering (QA)is not just building systems; this NLP subfield also creates and curates challenging question datasets that reveal the best systems.We argue that QA datasets-and QA leaderboards-closely resemble trivia tournaments: the questions agents-humans or machines-answer reveals a "winner". However, the research community has ignored the lessons from decades of the trivia community creating vibrant, fair, and effective QA competitions. After detailing problems with existing QA datasets, we outline several lessons that transfer to QA research: removing ambiguity, identifying better QA agents, and adjudicating disputes.

show abstract

Systematic Error Analysis of the Stanford Question Answering Dataset

Cited by 26 publications

References 9 publications

An Empirical Study of Content Understanding in Conversational Question Answering

An Empirical Study of Content Understanding in Conversational Question Answering

Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards?

What Question Answering can Learn from Trivia Nerds

Contact Info

Product

Resources

About