2021
DOI: 10.48550/arxiv.2104.08704
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation

Abstract: Large pretrained generative models like GPT-3 often suffer from hallucinating non-existent or incorrect content, which undermines their potential merits in real applications. Existing work usually attempts to detect these hallucinations based on a corresponding oracle reference at a sentence or document level. However ground-truth references may not be readily available for many free-form text generation applications, and sentence-or document-level detection may fail to provide the fine-grained signals that wo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(8 citation statements)
references
References 34 publications
0
8
0
Order By: Relevance
“…Dhuliawala et al, 2023). Recent survey (Ji et al, 2023;Zhang et al, 2023c; and evaluation benchmarks (Liu et al, 2021;Li et al, 2023a;Yang et al, 2023) have highlighted the importance of addressing this issue. Previous research has explored hallucination evaluation using confidence-based approaches (Xiao and Wang, 2021;Varshney et al, 2023;Chen and Mueller, 2023) that require access to token-level log probability (Kuhn et al, 2023;Cole et al, 2023) or supervised tuning (Agrawal et al, 2023;Li et al, 2023b) that relies on internal states of the LM.…”
Section: Related Workmentioning
confidence: 99%
“…Dhuliawala et al, 2023). Recent survey (Ji et al, 2023;Zhang et al, 2023c; and evaluation benchmarks (Liu et al, 2021;Li et al, 2023a;Yang et al, 2023) have highlighted the importance of addressing this issue. Previous research has explored hallucination evaluation using confidence-based approaches (Xiao and Wang, 2021;Varshney et al, 2023;Chen and Mueller, 2023) that require access to token-level log probability (Kuhn et al, 2023;Cole et al, 2023) or supervised tuning (Agrawal et al, 2023;Li et al, 2023b) that relies on internal states of the LM.…”
Section: Related Workmentioning
confidence: 99%
“…Next, we examine quality with human evaluation, as shown in Figure 3. Models generating nonfactual or hallucinated content is an ongoing area of study (Tian et al, 2019;Nie et al, 2019;Liu et al, 2021). Our goal is to understand how much information in the generated text is present in the reference text or the web evidence, as a proxy for factuality and coverage.…”
Section: Quality Of Generated Biographiesmentioning
confidence: 99%
“…The likelihoods of these statements are often dominated by short plausible patterns, which also makes it difficult for LLMs to evaluate their own uncertainty about a fact. Thus, detection (Liu et al, 2021;Zhou et al, 2021) and reduction of such hallucinations is crucial for widespread use of LLMs in real applications. (Dziri et al, 2021;Shuster et al, 2021).…”
Section: Uncertainty and Hallucination Detectionmentioning
confidence: 99%