Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 2021
DOI: 10.18653/v1/2021.findings-acl.120
|View full text |Cite
|
Sign up to set email alerts
|

Detecting Hallucinated Content in Conditional Neural Sequence Generation

Abstract: Neural sequence models can generate highly fluent sentences, but recent studies have also shown that they are also prone to hallucinate additional content not supported by the input. These variety of fluent but wrong outputs are particularly problematic, as it will not be possible for users to tell they are being presented incorrect content. To detect these errors, we propose a task to predict whether each token in the output sequence is hallucinated (not contained in the input) and collect new manually annota… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
62
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 67 publications
(62 citation statements)
references
References 38 publications
0
62
0
Order By: Relevance
“…Hallucination in text-generation models is a topic that has received attention recently, particularly in the settings of summarization (Maynez et al, 2020), machine translation (Zhou et al, 2021), and news generation (Zellers et al, 2019). For dialogue, it has been observed in state-of-the-art models (Roller et al, 2021) and studied in depth (Mielke et al, 2020), but so far without resolution.…”
Section: Related Workmentioning
confidence: 99%
“…Hallucination in text-generation models is a topic that has received attention recently, particularly in the settings of summarization (Maynez et al, 2020), machine translation (Zhou et al, 2021), and news generation (Zellers et al, 2019). For dialogue, it has been observed in state-of-the-art models (Roller et al, 2021) and studied in depth (Mielke et al, 2020), but so far without resolution.…”
Section: Related Workmentioning
confidence: 99%
“…There has been a lot of recent work in abstractive summarization showing that state-of-the-art systems suffer from generating inconsistent information with respect to the source article, despite their improved success in producing fluent summaries (Falke et al, 2019;Lux et al, 2020). Since word-overlap based metrics such as ROUGE have low correlation with human scores of faithfulness (Kryscinski et al, 2019;Fabbri et al, 2020), there has been significant effort to develop automated metrics that can detect such errors (Zhou et al, 2021;Gabriel et al, 2021;Pagnoni et al, 2021a). For example, Falke et al (2019), Maynez et al (2020) and Goyal and Durrett (2020) have proposed to assess faithfulness using entailment models, where a faithful summary should be assigned a high entailment score with respect to the original article.…”
Section: Related Workmentioning
confidence: 99%
“…We note the relatively small training time of the mBART adaptation and the lack of Icelandic data in the pretraining task for mBART as primary factors that can be addressed for improving results. Additionally online (or semi-online) self-training instead of train-then-translate would also improve results, especially with selective loss truncation as described in (Zhou et al, 2021). The data selected for backtranslation should also be expanded for greater diversity of both genre and vocabulary.…”
Section: Future Workmentioning
confidence: 99%