Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries

Tang, Xiangru; Fabbri, Alexander R.; Li, Haoran; Mao, Ziming; Adams, Griffin; Wang, Borui; Çelikyılmaz, Aslı; Mehdad, Yashar; Radev, Dragomir

doi:10.18653/v1/2022.naacl-main.417

Cited by 4 publications

(3 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Kiritchenko and Mohammad (2017) demonstrated that best-worst scaling (asking evaluators to choose the best and the worst items in a set) is an efficient and reliable method for collecting annotations, and this approach has been used to collect comparative evaluations of generated text (e.g., Liu & Lapata, 2019;Amplayo et al, 2021). Best-worst scaling was also more recently been shown as a more effective approach than Likert scales for assessing factual consistency of summaries (Tang et al, 2022). Belz and Kow (2011) further compare continuous and discrete rating scales and found that both lead to similar results, but raters preferred continuous scales, consistent with prior findings (Svensson, 2000).…”

Section: How Is It Measured?mentioning

confidence: 99%

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Gehrmann

Clark

Sellam³

2023

jair

View full text Add to dashboard Cite

Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted. This issue has become more urgent, since neural generation models have improved to the point where their outputs can often no longer be distinguished based on the surface-level features that older metrics rely on. This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years. We summarize, categorize, and discuss how researchers have been addressing these issues and what their findings mean for the current state of model evaluations. Building on those insights, we lay out a long-term vision for evaluation research and propose concrete steps for researchers to improve their evaluation processes. Finally, we analyze 66 generation papers from recent NLP conferences in how well they already follow these suggestions and identify which areas require more drastic changes to the status quo.

show abstract

Section: How Is It Measured?mentioning

confidence: 99%

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Gehrmann

Clark

Sellam³

2023

jair

View full text Add to dashboard Cite

show abstract

“…Generative language models have been widely adopted for response generation [14,28]; however, in the realm of open-ended information-seeking dialogues, the assumption that a user's query can be definitively answered by simply summarizing information from top retrieved passages falls short of reality. System responses are susceptible to various limitations, such as the failure to find a response which may result in hallucinations [10], providing a biased response only partially answering the question [9], or even presenting content with factual errors [26]. Consequently, relying solely on summarizing relevant information may lead to providing users with biased, incomplete, or, worse, incorrect responses [26].…”

Section: Introductionmentioning

confidence: 99%

“…System responses are susceptible to various limitations, such as the failure to find a response which may result in hallucinations [10], providing a biased response only partially answering the question [9], or even presenting content with factual errors [26]. Consequently, relying solely on summarizing relevant information may lead to providing users with biased, incomplete, or, worse, incorrect responses [26].…”

Section: Introductionmentioning

confidence: 99%

Grounded and Transparent Response Generation for Conversational Information-Seeking Systems

Łajewska

2024

Proceedings of the 17th ACM International Conference on Web Search and Data Mining

View full text Add to dashboard Cite

While previous conversational information-seeking (CIS) research has focused on passage retrieval, reranking, and query rewriting, the challenge of synthesizing retrieved information into coherent responses remains. The proposed research delves into the intricacies of response generation in CIS systems. Open-ended informationseeking dialogues introduce multiple challenges that may lead to potential pitfalls in system responses. The study focuses on generating responses grounded in the retrieved passages and being transparent about the system's limitations. Specific research questions revolve around obtaining confidence-enriched information nuggets, automatic detection of incomplete or incorrect responses, generating responses communicating the system's limitations, and evaluating enhanced responses. By addressing these research tasks the study aspires to contribute to the advancement of conversational response generation, fostering more trustworthy interactions in CIS dialogues, and paving the way for grounded and transparent systems to meet users' needs in an information-driven world. CCS CONCEPTS• Information systems → Presentation of retrieval results.

show abstract

A Survey of Factual Consistency in Summarization from 2021 to 2023

Li,

Hao,

2023

2023 4th International Conference on Computers and Artificial Intelligence Technology (CAIT)

View full text Add to dashboard Cite

Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries

Cited by 4 publications

References 26 publications

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Grounded and Transparent Response Generation for Conversational Information-Seeking Systems

A Survey of Factual Consistency in Summarization from 2021 to 2023

Contact Info

Product

Resources

About