2020
DOI: 10.48550/arxiv.2007.06898
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Our Evaluation Metric Needs an Update to Encourage Generalization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
3

Relationship

3
0

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 0 publications
0
3
0
Order By: Relevance
“…Gokhale et al (2022) compares multiple ways to improve the OOD performance of an extractive model on QA task, and how these methods affect generative models have not been well-studied yet. Meanwhile, most of the work including this work evaluate OOD performance by averaging the performance across multiple dataset, but as mentioned in (Mishra et al, 2020), the evaluation should be more carefully designed. Also, Diagnosing the performance on each OOD dataset can provide more insights.…”
Section: Discussionmentioning
confidence: 99%
“…Gokhale et al (2022) compares multiple ways to improve the OOD performance of an extractive model on QA task, and how these methods affect generative models have not been well-studied yet. Meanwhile, most of the work including this work evaluate OOD performance by averaging the performance across multiple dataset, but as mentioned in (Mishra et al, 2020), the evaluation should be more carefully designed. Also, Diagnosing the performance on each OOD dataset can provide more insights.…”
Section: Discussionmentioning
confidence: 99%
“…We will prune all 3 datasets with the terms of selected components (based on initial SNLI pruning), to varying sizes, similar to Table 1. In recent work, word overlap [11,4] and semantic textual similarity [30] have been dominant in producing spurious bias; we therefore expect to shortlist C 3 and C 5 in our component-wise experiments. Previous work has found that the amount of artifacts in datasets is in the order: SNLI>SQUAD>MNLI [11,4,44,41,35,31].…”
Section: Proposed Experimentsmentioning
confidence: 97%
“…Using OOD Detection systems for selective prediction (abstain on all detected OOD instances) would be too conservative as it has been shown that models are able to correctly answer a significant fraction of OOD instances (Talmor and Berant, 2019;Hendrycks et al, 2020;Mishra et al, 2020).…”
Section: Appendix a Related Tasksmentioning
confidence: 99%