Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.322
|View full text |Cite
|
Sign up to set email alerts
|

Evaluating Factuality in Generation with Dependency-level Entailment

Abstract: Despite significant progress in text generation models, a serious limitation is their tendency to produce text that is factually inconsistent with information in the input. Recent work has studied whether textual entailment systems can be used to identify factual errors; however, these sentence-level entailment models are trained to solve a different problem than generation filtering and they do not localize which part of a generation is non-factual. In this paper, we propose a new formulation of entailment th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
85
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1

Relationship

3
5

Authors

Journals

citations
Cited by 58 publications
(85 citation statements)
references
References 26 publications
(33 reference statements)
0
85
0
Order By: Relevance
“…First, while synthetic data generation approaches are specifically designed for factuality evaluation, do these align with actual errors made by generation models? We find the answer is no: techniques using surface-level data corruption (Kryscinski et al, 2020;Zhao et al, 2020;Cao et al, 2020) or paraphrasing (Goyal and Durrett, 2020a) target inherently different error distributions than those seen in actual model generations, and factuality models trained on these datasets perform poorly in practice. Furthermore, we show that different summarization domains, CNN/Daily Mail (Hermann et al, 2015;Nallapati et al, 2016) and XSum (Narayan et al, 2018) (which differ in the style of summaries and degree of abstraction), exhibit substantially different error distributions in generated summaries, and the same dataset creation approach cannot be used across the board.…”
Section: Introductionmentioning
confidence: 95%
See 2 more Smart Citations
“…First, while synthetic data generation approaches are specifically designed for factuality evaluation, do these align with actual errors made by generation models? We find the answer is no: techniques using surface-level data corruption (Kryscinski et al, 2020;Zhao et al, 2020;Cao et al, 2020) or paraphrasing (Goyal and Durrett, 2020a) target inherently different error distributions than those seen in actual model generations, and factuality models trained on these datasets perform poorly in practice. Furthermore, we show that different summarization domains, CNN/Daily Mail (Hermann et al, 2015;Nallapati et al, 2016) and XSum (Narayan et al, 2018) (which differ in the style of summaries and degree of abstraction), exhibit substantially different error distributions in generated summaries, and the same dataset creation approach cannot be used across the board.…”
Section: Introductionmentioning
confidence: 95%
“…Past work has explored using off-the-shelf frameworks such as entailment models (Falke et al, 2019) or QA systems (Durmus et al, 2020; to detect and sometimes correct errors in generated summaries. Another line of recent work has used synthetically generated data to specifically train models on the factuality detection task (Kryscinski et al, 2020;Zhao et al, 2020;Goyal and Durrett, 2020a). However, these efforts have focused on different datasets, summarization systems, and error types, often shedding little light on what errors state-of-the-art systems are actually making and how to fix them.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Despite their strong performance on automatic metrics like ROUGE (Lin, 2004), abstractive models are not as straightforward and interpretable as their extractive counterparts. Free-form generation in these models also leads to serious downstream errors, such as factual inconsistencies with the input document (Cao et al, 2018;Kryściński et al, 2020;Wang et al, 2020;Durmus et al, 2020;Goyal and Durrett, 2020). Although the interpretability of NLU models has been extensively studied (Ribeiro et al, 2016;Ghaeini et al, 2018;Jain and Wallace, 2019;Desai and Durrett, 2020), summarization models specifically have not received similar attention, with analysis efforts often focused on datasets and evaluation (Kryscinski et al, 2019).…”
Section: Introductionmentioning
confidence: 99%
“…Plausible compressions are those that, when applied, result in grammatical and factual sentences; that is, sentences that are syntactically permissible, linguistically acceptable to native speakers (Chomsky, 1956;Schütze, 1996), and factually correct from the perspective of the original sentence. Satisfying these three criteria is challenging: acceptability is inherently subjective and measuring factuality in text generation is a major open problem (Kryściński et al, 2020;Durmus et al, 2020;Goyal and Durrett, 2020). Figure 1 gives examples of plausible deletions: note that of dozens of California wineries would be grammatical to delete but significantly impacts factuality.…”
Section: Plausibilitymentioning
confidence: 99%