Data-QuestEval: A Referenceless Metric for Data-to-Text Semantic Evaluation

Rebuffel, Clément; Scialom, Thomas; Soulier, Laure; Piwowarski, Benjamin; Lamprier, Sylvain; Staiano, Jacopo; Scoutheeten, Geoffrey; Gallinari, Patrick

doi:10.48550/arxiv.2104.07555

Cited by 4 publications

(5 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hallucination detection. Existing research primarily contains statistical metrics [28,74,80], model-based metrics (including Information Extraction (IE)-based metric, QA-based metric [32,65,68], Natural Language Inference (NLI) Metrics [33,38,81], Faithfulness Classification Metrics [32,48,89], LM-based Metrics [26,75]), and human-based evaluations [69,73]. We list some typical work as follows: Dhingra et al [22] propose PARENT to measure hallucinations using both the source and target text as references.…”

Section: Related Workmentioning

confidence: 99%

Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models

Chen,

Fu,

Yuan

et al. 2023

Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

Large Language Models (LLMs) have gained widespread adoption in various natural language processing tasks, including question answering and dialogue systems. However, a major drawback of LLMs is the issue of hallucination, where they generate unfaithful or inconsistent content that deviates from the input source, leading to severe consequences. In this paper, we propose a robust discriminator named RelD to effectively detect hallucination in LLMs' generated answers. RelD is trained on the constructed RelQA, a bilingual question-answering dialogue dataset along with answers generated by LLMs and a comprehensive set of metrics. Our experimental results demonstrate that the proposed RelD successfully detects hallucination in the answers generated by diverse LLMs. Moreover, it performs well in distinguishing hallucination in LLMs' generated answers from both in-distribution and out-of-distribution datasets. Additionally, we also conduct a thorough analysis of the types of hallucinations that occur and present valuable insights. This research significantly contributes to the detection of reliable * Work done while this author was an intern at Microsoft Research.

show abstract

Section: Related Workmentioning

confidence: 99%

Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models

Chen,

Fu,

Yuan

et al. 2023

Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

show abstract

“…With the success of neural techniques in text generation tasks, applying neural sequence-to-sequence generation models became more common (Du et al, 2017;Sun et al, 2018). More recent works leverage pre-trained transformer based networks, such as T5 (Raffel et al, 2020), BART (Lewis et al, 2019), PEGASUS and Prophet-Net (Yan et al, 2020b), for question generation which have been successful in many applications (Dong et al, 2019b;Lelkes et al, 2021;Rebuffel et al, 2021;Pan et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

“…Question generation (QG) aims to automatically create questions from a given text passage or document with or without answers. It has a wide range of applications such as improving question answering (QA) systems (Duan et al, 2017) and search engines (Han et al, 2019) through data augmentation, making chatbots more engaging Laban et al, 2020), enabling automatic evaluation (Rebuffel et al, 2021) and fact verification (Pan et al, 2021), and facilitating educational applications (Chen et al, 2018).…”

Section: Introductionmentioning

confidence: 99%

MixQG: Neural Question Generation with Mixed Answer Types

Lidiya¹,

Wu²,

Niu³

et al. 2022

Findings of the Association for Computational Linguistics: NAACL 2022

View full text Add to dashboard Cite

Asking good questions is an essential ability for both human and machine intelligence. However, existing neural question generation approaches mainly focus on short factoid type of answers. In this paper, we introduce a neural question generator, MixQG, to bridge this gap. We combine nine question answering datasets with diverse answer types, including yes/no, multiple-choice, extractive, and abstractive answers, to train a single generative model. We show with empirical results that our model outperforms existing work in both seen and unseen domains, and can generate questions with different cognitive levels when conditioned on different answer types. We run a human evaluation study to assess the quality of generated questions and find that MixQG outperforms the next best model by 10%. Our code and model checkpoints will be released and integrated with the HuggingFace library to facilitate various downstream applications.

show abstract

“…BARTScore (Yuan et al, 2021): Because ROUGE scores only measure token overlap, other automated metrics (Rebuffel et al, 2021;Kryscinski et al, 2020;Wang et al, 2020;, et al, 2005) and SAMSum (Gliwa et al, 2019) datasets. We adopt some results reported from the literature (Feng et al, 2021a) and implement the pre-trained models for a fair comparison.…”

Section: Evaluation Metricsmentioning

confidence: 99%

CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning

Tang¹,

Nair²,

Wang³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Factual inconsistencies in generated summaries severely limit the practical applications of abstractive dialogue summarization. Although significant progress has been achieved by using pre-trained neural language models, substantial amounts of hallucinated content are found during the human evaluation. In this work, we first devised a typology of factual errors to better understand the types of hallucinations generated by current models and conducted human evaluation on popular dialog summarization dataset. We further propose a training strategy that improves the factual consistency and overall quality of summaries via a novel contrastive fine-tuning, called CON-FIT. To tackle top factual errors from our annotation, we introduce additional contrastive loss with carefully designed hard negative samples and self-supervised dialogue-specific loss to capture the key information between speakers. We show that our model significantly reduces all kinds of factual errors on both SAMSum dialogue summarization and AMI meeting summarization. On both datasets, we achieve significant improvements over stateof-the-art baselines using both automatic metrics, ROUGE and BARTScore, and human evaluation.

show abstract

Data-QuestEval: A Referenceless Metric for Data-to-Text Semantic Evaluation

Cited by 4 publications

References 17 publications

Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models

Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models

MixQG: Neural Question Generation with Mixed Answer Types

CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning

Contact Info

Product

Resources

About