Abstract:Despite significant progress in neural abstractive summarization, recent studies have shown that the current models are prone to generating summaries that are unfaithful to the original context. To address the issue, we study contrast candidate generation and selection as a model-agnostic post-processing technique to correct the extrinsic hallucinations (i.e. information not present in the source text) in unfaithful summaries. We learn a discriminative correction model by generating alternative candidate summa… Show more
“…Multiple terminologies, such as faithfulness [20,22,50,117,133,144,144,163,172,195,219], factual consistency [18,19,24,154,157,194], fidelity [23], factualness 4 [146], factuality 4 [33],…”
Section: Human Evaluationmentioning
confidence: 99%
“…or on the other hand, hallucination [40,73,107,154,158], fact contradicting [129] are used in the human evaluation of hallucination to rate whether the generated text is in accord with the source input. Chen et al [22], Nie et al [130] use finer-grained metrics for intrinsic hallucination and extrinsic hallucination separately. Moreover, there are some broad metrics, such as Correctness [7,12,98,182], Accuracy [97,203], and Informativeness [102] considering both missing and additional contents (extrinsic hallucinations) compared to the input source.…”
Section: Human Evaluationmentioning
confidence: 99%
“…Consequently, a better semantic understanding helps alleviate the divergence issue from the source. For example, augmented with entity information [107], extracted relation triples from source document [20,73] obtained by Fact Description Extraction, synthetic data generated through replacement or perturbation [22,91], retrieved external knowledge [12,45,65,158,222], and retrieved similar training samples [13].…”
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent natural language generation, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended texts, which degrades the system performance and fail to meet user expectations in many real-world scenarios. In order to address this issue, there have been studies in measuring and mitigating hallucinated texts. However there has not been a comprehensive review of the state-of-the-art in hallucination detection and mitigation.In this survey, we provide a broad overview of the research progress and challenges in the hallucination problem of NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; (2) an overview of task-specific research progress for hallucinations in a large set of downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, and machine translation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
“…Multiple terminologies, such as faithfulness [20,22,50,117,133,144,144,163,172,195,219], factual consistency [18,19,24,154,157,194], fidelity [23], factualness 4 [146], factuality 4 [33],…”
Section: Human Evaluationmentioning
confidence: 99%
“…or on the other hand, hallucination [40,73,107,154,158], fact contradicting [129] are used in the human evaluation of hallucination to rate whether the generated text is in accord with the source input. Chen et al [22], Nie et al [130] use finer-grained metrics for intrinsic hallucination and extrinsic hallucination separately. Moreover, there are some broad metrics, such as Correctness [7,12,98,182], Accuracy [97,203], and Informativeness [102] considering both missing and additional contents (extrinsic hallucinations) compared to the input source.…”
Section: Human Evaluationmentioning
confidence: 99%
“…Consequently, a better semantic understanding helps alleviate the divergence issue from the source. For example, augmented with entity information [107], extracted relation triples from source document [20,73] obtained by Fact Description Extraction, synthetic data generated through replacement or perturbation [22,91], retrieved external knowledge [12,45,65,158,222], and retrieved similar training samples [13].…”
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent natural language generation, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended texts, which degrades the system performance and fail to meet user expectations in many real-world scenarios. In order to address this issue, there have been studies in measuring and mitigating hallucinated texts. However there has not been a comprehensive review of the state-of-the-art in hallucination detection and mitigation.In this survey, we provide a broad overview of the research progress and challenges in the hallucination problem of NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; (2) an overview of task-specific research progress for hallucinations in a large set of downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, and machine translation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
“…Improving faithfulness of summarization systems is essential for deploying these systems in realworld scenarios, as such recent work has studied methods to improve the faithfulness of abstractive summarization systems (Zhao et al, 2020;Dong et al, 2020;Goyal and Durrett, 2021;Xu et al, 2020;Chen et al, 2021;Zhu et al, 2021). For example, Goyal and Durrett (2021) train summarization systems by modifying the training objective to maximize the likelihood of the subset of summary tokens that are considered faithful according to their factuality detection model.…”
Section: Related Workmentioning
confidence: 99%
“…Order determined by coin flip. as well as methods to improve faithfulness of generated summaries (Kang and Hashimoto, 2020;Chen et al, 2021). Intuitively, one straightforward way of improving faithfulness of generated summaries is to copy a larger amount of content from the source article (i.e.…”
Despite recent progress in abstractive summarization, systems still suffer from faithfulness errors. While prior work has proposed models that improve faithfulness, it is unclear whether the improvement comes from an increased level of extractiveness of the model outputs as one naive way to improve faithfulness is to make summarization models more extractive. In this work, we present a framework for evaluating the effective faithfulness of summarization systems, by generating a faithfulnessabstractiveness trade-off curve that serves as a control at different operating points on the abstractiveness spectrum. We then show that the Maximum Likelihood Estimation (MLE) baseline as well as a recently proposed method for improving faithfulness, are both worse than the control at the same level of abstractiveness. Finally, we learn a selector to identify the most faithful and abstractive summary for a given document, and show that this system can attain higher faithfulness scores in human evaluations while being more abstractive than the baseline system on two datasets. Moreover, we show that our system is able to achieve a better faithfulness-abstractiveness trade-off than the control at the same level of abstractiveness.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.