Our Evaluation Metric Needs an Update to Encourage Generalization

Mishra, Swaroop; Arunkumar, Anjana; Bryan, Chris; Baral, Chitta

doi:10.48550/arxiv.2007.06898

Cited by 3 publications

(3 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Gokhale et al (2022) compares multiple ways to improve the OOD performance of an extractive model on QA task, and how these methods affect generative models have not been well-studied yet. Meanwhile, most of the work including this work evaluate OOD performance by averaging the performance across multiple dataset, but as mentioned in (Mishra et al, 2020), the evaluation should be more carefully designed. Also, Diagnosing the performance on each OOD dataset can provide more insights.…”

Section: Discussionmentioning

confidence: 99%

Choose Your QA Model Wisely: A Systematic Study of Generative and Extractive Readers for Question Answering

Luo¹,

Hashimoto²,

Yavuz³

et al. 2022

Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic From Knowledge

Self Cite

View full text Add to dashboard Cite

While both extractive and generative readers have been successfully applied to the Question Answering (QA) task, little attention has been paid toward the systematic comparison of them. Characterizing the strengths and weaknesses of the two readers is crucial not only for making a more informed reader selection in practice but also for developing a deeper understanding to foster further research on improving readers in a principled manner. Motivated by this goal, we make the first attempt to systematically study the comparison of extractive and generative readers for question answering. To be aligned with the state-of-the-art, we explore nine transformer-based large pre-trained language models (PrLMs) as backbone architectures. Furthermore, we organize our findings under two main categories: (1) keeping the architecture invariant, and (2) varying the underlying PrLMs. Among several interesting findings, it is important to highlight that (1) the generative readers perform better in long context QA, (2) the extractive readers perform better in short context while also showing better out-of-domain generalization, and (3) the encoder of encoder-decoder PrLMs (e.g., T5) turns out to be a strong extractive reader and outperforms the standard choice of encoderonly PrLMs (e.g., RoBERTa). We also study the effect of multi-task learning on the two types of readers varying the underlying PrLMs and perform qualitative and quantitative diagnosis to provide further insights into future directions in modeling better readers.

show abstract

Section: Discussionmentioning

confidence: 99%

Choose Your QA Model Wisely: A Systematic Study of Generative and Extractive Readers for Question Answering

Luo¹,

Hashimoto²,

Yavuz³

et al. 2022

Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic From Knowledge

Self Cite

View full text Add to dashboard Cite

show abstract

“…We will prune all 3 datasets with the terms of selected components (based on initial SNLI pruning), to varying sizes, similar to Table 1. In recent work, word overlap [11,4] and semantic textual similarity [30] have been dominant in producing spurious bias; we therefore expect to shortlist C 3 and C 5 in our component-wise experiments. Previous work has found that the amount of artifacts in datasets is in the order: SNLI>SQUAD>MNLI [11,4,44,41,35,31].…”

Section: Proposed Experimentsmentioning

confidence: 97%

A Proposal to Study "Is High Quality Data All We Need?"

Mishra¹,

Arunkumar²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Even though deep neural models have achieved superhuman performance on many popular benchmarks, they have failed to generalize to OOD or adversarial datasets. Conventional approaches aimed at increasing robustness include developing increasingly large models and augmentation with large scale datasets. However, orthogonal to these trends, we hypothesize that a smaller, high quality dataset is what we need. Our hypothesis is based on the fact that deep neural networks are data driven models, and data is what leads/misleads models. In this work, we propose an empirical study that examines how to select a subset of and/or create high quality benchmark data, for a model to learn effectively. We seek to answer if big datasets are truly needed to learn a task, and whether a smaller subset of high quality data can replace big datasets. We plan to investigate both data pruning and data creation paradigms to generate high quality datasets. IntroductionDeep neural models such as EfficientNet-B7 [40], BERT [6] and RoBERTA [26] have achieved super-human performance on many popular benchmarks in various domains such as Imagenet [37], SNLI [3], and SQUAD [36]. However, their performance drops drastically on exposure to out of distribution (OOD) and adversarial datasets [14,8,15,13]. Lots of resources and time are being invested in developing better models and architectures, such as transformer based approaches [45], that dominate leaderboards. Since deep learning -a data driven approach-finds representation from data, shouldn't the focus be placed on creating 'better' datasets rather than developing increasingly complex models? Let us consider this through an analogy-a student (A) is asked to self-learn a concept by going through a question bank (Q 1 ), where there are 1000 solved questions. After self-learning, A is tested using 100 unsolved questions present at the end of the Q 1 . While A achieves unprecedented performance (85/100), beating other students who are explicitly taught the concept, when tested on another 100 questions on the same topic from question bank (Q 2 ), A fails on 50 questions. Similarly, if A is interviewed by a teacher, A fails to answer 70 questions. On analysis, we see that A has not truly learned the concept in Q 1 ; instead, A solves questions by relying on common question patterns seen in Q 1 , and associating them with the provided answers. To fix this, suppose A is provided 1000 solved questions from Q 2 . On testing, we find that A now correctly answers 90/100 unsolved questions from Q 2 , but only 55/100 from Q 1 , and 35/100 in the interview. Now, we provide A with 100 question banks in a similar manner, and find that A's performance on both Q 1 and Q 2 is 70/100, and is 40/100 for the interview. To improve interview performance, suppose that the interviewer prepares an additional question bank Q i , then if A selflearns using both Q 1 and Q i , then the scores for Q 1 , Q 2 , and the interview are 80, 45, and 80/100 Pre-registration workshop NeurIPS (2020),

show abstract

“…Using OOD Detection systems for selective prediction (abstain on all detected OOD instances) would be too conservative as it has been shown that models are able to correctly answer a significant fraction of OOD instances (Talmor and Berant, 2019;Hendrycks et al, 2020;Mishra et al, 2020).…”

Section: Appendix a Related Tasksmentioning

confidence: 99%

Investigating Selective Prediction Approaches Across Several Tasks in IID, OOD, and Adversarial Settings

Varshney¹,

Mishra²,

Baral³

2022

Findings of the Association for Computational Linguistics: ACL 2022

Self Cite

View full text Add to dashboard Cite

In order to equip NLP systems with 'selective prediction' capability, several task-specific approaches have been proposed. However, which approaches work best across tasks or even if they consistently outperform the simplest baseline MaxProb remains to be explored. To this end, we systematically study selective prediction in a large-scale setup of 17 datasets across several NLP tasks. Through comprehensive experiments under in-domain (IID), out-ofdomain (OOD), and adversarial (ADV) settings, we show that despite leveraging additional resources (held-out data/computation), none of the existing approaches consistently and considerably outperforms MaxProb in all three settings. Furthermore, their performance does not translate well across tasks. For instance, Monte-Carlo Dropout outperforms all other approaches on Duplicate Detection datasets but does not fare well on NLI datasets, especially in the OOD setting. Thus, we recommend that future selective prediction approaches should be evaluated across tasks and settings for reliable estimation of their capabilities.

show abstract

Our Evaluation Metric Needs an Update to Encourage Generalization

Cited by 3 publications

References 0 publications

Choose Your QA Model Wisely: A Systematic Study of Generative and Extractive Readers for Question Answering

Choose Your QA Model Wisely: A Systematic Study of Generative and Extractive Readers for Question Answering

A Proposal to Study "Is High Quality Data All We Need?"

Investigating Selective Prediction Approaches Across Several Tasks in IID, OOD, and Adversarial Settings

Contact Info

Product

Resources

About