A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation

Li, Tianyu; Zhang, Yizhe; Brockett, Chris; Mao, Yi; Sui, Zhifang; Chen, Weizhu; Dolan, Bill

doi:10.48550/arxiv.2104.08704

Cited by 7 publications

(8 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Dhuliawala et al, 2023). Recent survey (Ji et al, 2023;Zhang et al, 2023c; and evaluation benchmarks (Liu et al, 2021;Li et al, 2023a;Yang et al, 2023) have highlighted the importance of addressing this issue. Previous research has explored hallucination evaluation using confidence-based approaches (Xiao and Wang, 2021;Varshney et al, 2023;Chen and Mueller, 2023) that require access to token-level log probability (Kuhn et al, 2023;Cole et al, 2023) or supervised tuning (Agrawal et al, 2023;Li et al, 2023b) that relies on internal states of the LM.…”

Section: Related Workmentioning

confidence: 99%

SAC3: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency

Zhang,

Li,

Das

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Hallucination detection is a critical step toward understanding the trustworthiness of modern language models (LMs). To achieve this goal, we re-examine existing detection approaches based on the self-consistency of LMs and uncover two types of hallucinations resulting from 1) question-level and 2) model-level, which cannot be effectively identified through selfconsistency check alone. Building upon this discovery, we propose a novel sampling-based method, i.e., semantic-aware cross-check consistency (SAC 3 ) that expands on the principle of self-consistency checking. Our SAC 3 approach incorporates additional mechanisms to detect both question-level and model-level hallucinations by leveraging advances including semantically equivalent question perturbation and cross-model response consistency checking. Through extensive and systematic empirical analysis, we demonstrate that SAC 3 outperforms the state of the art in detecting both nonfactual and factual statements across multiple question-answering and open-domain generation benchmarks. 1

show abstract

Section: Related Workmentioning

confidence: 99%

SAC3: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency

Zhang,

Li,

Das

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

show abstract

“…Next, we examine quality with human evaluation, as shown in Figure 3. Models generating nonfactual or hallucinated content is an ongoing area of study (Tian et al, 2019;Nie et al, 2019;Liu et al, 2021). Our goal is to understand how much information in the generated text is present in the reference text or the web evidence, as a proxy for factuality and coverage.…”

Section: Quality Of Generated Biographiesmentioning

confidence: 99%

Generating Biographies on Wikipedia: The Impact of Gender Bias on the Retrieval-Based Generation of Women Biographies

Fan¹,

Gardent²

2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

Generating factual, long-form text such as Wikipedia articles raises three key challenges: how to gather relevant evidence, how to structure information into well-formed text, and how to ensure that the generated text is factually correct. We address these by developing a model for English text that uses a retrieval mechanism to identify relevant supporting information on the web and a cache-based pre-trained encoderdecoder to generate long-form biographies section by section, including citation information. To assess the impact of available web evidence on the output text, we compare the performance of our approach when generating biographies about women (for which less information is available on the web) vs. biographies generally. To this end, we curate a dataset of 1,500 biographies about women. We analyze our generated text to understand how differences in available web evidence data affect generation. We evaluate the factuality, fluency, and quality of the generated texts using automatic metrics and human evaluation. We hope that these techniques can be used as a starting point for human writers, to aid in reducing the complexity inherent in the creation of long-form, factual text.

show abstract

“…The likelihoods of these statements are often dominated by short plausible patterns, which also makes it difficult for LLMs to evaluate their own uncertainty about a fact. Thus, detection (Liu et al, 2021;Zhou et al, 2021) and reduction of such hallucinations is crucial for widespread use of LLMs in real applications. (Dziri et al, 2021;Shuster et al, 2021).…”

Section: Uncertainty and Hallucination Detectionmentioning

confidence: 99%

ThinkSum: Probabilistic reasoning over sets using large language models

Ozturkler¹,

Malkin²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Large language models (LLMs) have a substantial capacity for high-level analogical reasoning: reproducing patterns in linear text that occur in their training data (zero-shot evaluation) or in the provided context (few-shot in-context learning). However, recent studies show that even the largest LLMs fail in scenarios that require reasoning over multiple objects or facts or making sequences of logical deductions. We propose a two-stage probabilistic inference paradigm, THINKSUM, that reasons over sets of objects or facts in a structured manner. In the first stage (THINK -'fast' retrieval of associations), a LLM is queried in parallel over a set of phrases extracted from the prompt or an auxiliary model call. In the second stage (SUM -'slow' probabilistic inference or reasoning), the results of these queries are aggregated to make the final prediction. We demonstrate the advantages of THINKSUM on the BIG-bench suite of evaluation tasks, achieving improvements over the state of the art using GPT-family models on ten difficult tasks, often with far smaller model variants. We compare and contrast THINKSUM with other proposed modifications to direct prompting of LLMs, such as variants of chain-of-thought prompting. We argue that because the probabilistic inference in THINKSUM is performed outside of calls to the LLM, THINKSUM is less sensitive to prompt design, yields more interpretable predictions, and can be flexibly combined with latent variable models to extract structured knowledge from LLMs.

show abstract

A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation

Cited by 7 publications

References 34 publications

SAC3: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency

SAC3: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency

Generating Biographies on Wikipedia: The Impact of Gender Bias on the Retrieval-Based Generation of Women Biographies

ThinkSum: Probabilistic reasoning over sets using large language models

Contact Info

Product

Resources

About