Can NLI Models Verify QA Systems' Predictions?

Chen, Jifan; Choi, Eunsol; Durrett, Greg

doi:10.48550/arxiv.2104.08731

Cited by 4 publications

(3 citation statements)

References 52 publications

(81 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Fact Duration Following suit with the QA evaluations above, we also evaluate fact duration prediction on SituatedQA. To generate fact-duration pairs, we use the annotated previous answer as of 2021, converting the question/answer pair into statement using an existing T5-based conversion model (Chen et al, 2021a). We then use distance between the 2021 and previous answer's start date as the fact's duration, d.…”

Section: Evaluation Datasetsmentioning

confidence: 99%

“…TimeQA (Chen et al, 2021b) is one such work that curates a dataset of 70 different temporally-dependent relations from Wikidata and uses handcrafted templates to convert into decontextualized QA pairs, where the question specifies a time period. To convert this dataset into factduration pairs (f, d), we first convert their QA pairs into a factual statements by removing the date and using a QA-to-statement conversion model (Chen et al, 2021a). We then determine the duration of each facts to be the length of time between the start date of one answer to the question and the next.…”

Section: Distant Supervision Sourcesmentioning

confidence: 99%

See 1 more Smart Citation

Mitigating Temporal Misalignment by Discarding Outdated Facts

Zhang,

Choi

2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

While large language models are able to retain vast amounts of world knowledge seen during pretraining, such knowledge is prone to going out of date and is nontrivial to update. Furthermore, these models are often used under temporal misalignment, tasked with answering questions about the present, despite having only been trained on data collected in the past. To mitigate the effects of temporal misalignment, we propose fact duration prediction: the task of predicting how long a given fact will remain true. In our experiments, we demonstrate that identifying which facts are prone to rapid change can help models avoid reciting outdated information and determine which predictions require seeking out up-to-date knowledge sources. We also show how modeling fact duration improves calibration for knowledgeintensive tasks, such as open-retrieval question answering, under temporal misalignment, by discarding volatile facts. Our data and code are released publicly at https://github.com/ mikejqzhang/mitigating_misalignment.

show abstract

Section: Evaluation Datasetsmentioning

confidence: 99%

Section: Distant Supervision Sourcesmentioning

confidence: 99%

Mitigating Temporal Misalignment by Discarding Outdated Facts

Zhang,

Choi

2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…Answer Correctness (AC) QA models often lack the ability to verify the correctness of the predicted answer (Chen et al, 2021). One way to address this issue is to reformulate it to a textual entailment problem (Harabagiu and Hickl, 2006;Richardson et al, 2013;Chen et al, 2021) by viewing the answer context as the premise and the QA pair as the hypothesis. Then we use a natural language inference (NLI) system to verify whether the candidate answer proposed by crowd workers satisfies the entailment criterion.…”

Section: Answermentioning

confidence: 99%

Answer Quality Aware Aggregation for Extractive QA Crowdsourcing

Zhu¹,

Wang²,

Hauff³

et al. 2022

Findings of the Association for Computational Linguistics: EMNLP 2022

View full text Add to dashboard Cite

Quality control is essential for creating extractive question answering (EQA) datasets via crowdsourcing. Aggregation across answers, i.e. word spans within passages annotated, by different crowd workers is one major focus for ensuring its quality. However, crowd workers cannot reach a consensus on a considerable portion of questions. We introduce a simple yet effective answer aggregation method that takes into account the relations among the answer, question, and context passage. We evaluate answer quality from both the view of question answering model to determine how confident the QA model is about each answer and the view of the answer verification model to determine whether the answer is correct. Then we compute aggregation scores with each answer's quality and its contextual embedding produced by pre-trained language models. The experiments on a large real crowdsourced EQA dataset show that our framework outperforms baselines by around 16% on precision and effectively conduct answer aggregation for extractive QA task. The code is available at https://github.com/zpeide/Answer-Quality-Aware-Aggregation.

show abstract

Semantic Answer Similarity for Evaluating Question Answering Models

Risch¹,

Möller²,

Gutsch³

et al. 2021

Proceedings of the 3rd Workshop on Machine Reading for Question Answering

View full text Add to dashboard Cite

The evaluation of question answering models compares ground-truth annotations with model predictions. However, as of today, this comparison is mostly lexical-based and therefore misses out on answers that have no lexical overlap but are still semantically similar, thus treating correct answers as false. This underestimation of the true performance of models hinders user acceptance in applications and complicates a fair comparison of different models. Therefore, there is a need for an evaluation metric that is based on semantics instead of pure string similarity. In this short paper, we present SAS, a cross-encoder-based metric for the estimation of semantic answer similarity, and compare it to seven existing metrics. To this end, we create an English and a German three-way annotated evaluation dataset containing pairs of answers along with human judgment of their semantic similarity, which we release along with an implementation of the SAS metric and the experiments. We find that semantic similarity metrics based on recent transformer models correlate much better with human judgment than traditional lexical similarity metrics on our two newly created datasets and one dataset from related work.

show abstract

Can NLI Models Verify QA Systems' Predictions?

Cited by 4 publications

References 52 publications

Mitigating Temporal Misalignment by Discarding Outdated Facts

Mitigating Temporal Misalignment by Discarding Outdated Facts

Answer Quality Aware Aggregation for Extractive QA Crowdsourcing

Semantic Answer Similarity for Evaluating Question Answering Models

Contact Info

Product

Resources

About