FFCI: A Framework for Interpretable Automatic Evaluation of Summarization

Koto, Fajri; Lau, Jey Han; Baldwin, Timothy

doi:10.1613/jair.1.13167

Cited by 17 publications

(10 citation statements)

References 129 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To capture coverage, we record faithful-adjusted recall (FaR): the fraction of non-hallucinated reference entities included in a model output (Shing et al, 2021). As in Koto et al (2020); Adams et al (2021), we approximate coherence with the average next-sentence prediction (NSP) probability between adjacent sentences (we rely on an in-domain model, ClinicalBERT (Alsentzer et al, 2019)).…”

Section: Methodsmentioning

confidence: 99%

Learning to Revise References for Faithful Summarization

Adams¹,

Shing²,

Sun³

et al. 2022

Preprint

View full text Add to dashboard Cite

In many real-world scenarios with naturally occurring datasets, reference summaries are noisy and contain information that cannot be inferred from the source text. On large news corpora, removing low quality samples has been shown to reduce model hallucinations. Yet, this method is largely untested for smaller, noisier corpora. To improve reference quality while retaining all data, we propose a new approach: to revise-not remove-unsupported reference content. Without ground-truth supervision, we construct synthetic unsupported alternatives to supported sentences and use contrastive learning to discourage/encourage (un)faithful revisions. At inference, we vary style codes to over-generate revisions of unsupported reference sentences and select a final revision which balances faithfulness and abstraction. We extract a small corpus from a noisy source-the Electronic Health Record (EHR)-for the task of summarizing a hospital admission from multiple notes. Training models on original, filtered, and revised references, we find (1) learning from revised references reduces the hallucination rate substantially more than filtering (18.4% vs 3.8%), (2) learning from abstractive (vs extractive) revisions improves coherence, relevance, and faithfulness, (3) beyond redress of noisy data, the revision task has standalone value for the task: as a pretraining objective and as a post-hoc editor 1 . * This project was completed during an NLP research internship with Amazon Comprehend Medical.1 Code to recreate all MIMIC-III (Johnson et al., 2016) data and models (corruption, revision, and summarization) is available at https://github.com/amazon-research/ summary-reference-revision.

show abstract

Section: Methodsmentioning

confidence: 99%

Learning to Revise References for Faithful Summarization

Adams¹,

Shing²,

Sun³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Quantitative comparisons among entailmentbased and QA-based metrics, however, often differ in their choices of baseline model and input granularity, evaluating on single datasets and drawing differing conclusions as to the best paradigm. For example, some work reports entailment-based metrics as performing best (Koto et al, 2020;Maynez et al, 2020), while other work argues for QA metrics (Durmus et al, 2020;Wang et al, 2020b;Scialom et al, 2021). Recently, Laban et al (2021) pro-posed a benchmark called SummaC to compare metrics across six factual consistency datasets for the task of binary factual consistency classification, whether a summary is entirely factually consistent or not.…”

Section: Entailment Matrixmentioning

confidence: 99%

“…Evaluating Factual Consistency Within entailment-based factual consistency evaluation, Falke et al (2019) propose the task of ranking summary pairs for factual consistency based on entailment models, while Kryscinski et al (2020) explore factual consistency classification jointly with source support or contradiction span extrac-tion. Other work on entailment-based metrics has examined input granularity (Goyal and Durrett, 2020), trained on adversarial datasets (Barrantes et al, 2020), and explored entailment-based models as the backbone of others metrics such as BERTScore (Zhang et al, 2020b) as in Koto et al (2021). Metric comparisons, however, were often conducted on isolated datasets.…”

Section: Related Workmentioning

confidence: 99%

“…For XSF, we restrict the dataset to those examples with labels for factual consistency with respect to the source, as opposed to more general factuality labels which take into account world knowledge, which results in fewer examples than the original Sum-maC benchmark. This is the same subset as was used in Koto et al (2021). Please see the following links for the licenses of the datasets and annotations: CGS 2 , XSF 3 , FactCC 4 , SummEval 5 .…”

Section: A2 Benchmark Statisticsmentioning

confidence: 99%

See 1 more Smart Citation

QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization

Fabbri¹,

Wu²,

Liu³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Factual consistency is an essential quality of text summarization models in practical settings. Existing work in evaluating this dimension can be broadly categorized into two lines of research, entailment-based and question answering (QA)-based metrics, and different experimental setups often lead to contrasting conclusions as to which paradigm performs the best. In this work, we conduct an extensive comparison of entailment and QA-based metrics, demonstrating that carefully choosing the components of a QA-based metric, especially question generation and answerability classification, is critical to performance. Building on those insights, we propose an optimized metric, which we call QAFACTEVAL, that leads to a 14% average improvement over previous QA-based metrics on the SummaC factual consistency benchmark, and also outperforms the best-performing entailment-based metric. Moreover, we find that QA-based and entailment-based metrics can offer complementary signals and be combined into a single metric for a further performance boost.

show abstract

“…Recently, metrics have been proposed for evaluating factual consistency, including applying natural language inference (Falke et al, 2019;Kryscinski et al, 2020) and question-answering models (Eyal et al, 2019;Scialom et al, 2019;Durmus et al, 2020;Wang et al, 2020). However, current metrics still do not correlate highly with human judgments on factual consistency (Koto et al, 2020;Pagnoni et al, 2021). To overcome the inherent limitation of automatic metrics, researchers typically crowdsource human evaluations using platforms such as Amazon's Mechanical Turk (MTurk) (Gillick and Liu, 2010;Sabou et al, 2012;Lloret et al, 2013).…”

Section: Introductionmentioning

confidence: 99%

Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries

Tang¹,

Fabbri²,

Li³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Current pre-trained models applied for summarization are prone to factual inconsistencies that misrepresent the source text. Evaluating the factual consistency of summaries is thus necessary to develop better models. However, the human evaluation setup for evaluating factual consistency has not been standardized. To determine the factors that affect the reliability of the human evaluation, we crowdsource evaluations for factual consistency across stateof-the-art models on two news summarization datasets using the rating-based Likert Scale and ranking-based Best-Worst Scaling. Our analysis reveals that the ranking-based Best-Worst Scaling offers a more reliable measure of summary quality across datasets and that the reliability of Likert ratings highly depends on the target dataset and the evaluation design. To improve crowdsourcing reliability, we extend the scale of the Likert rating and present a scoring algorithm for Best-Worst Scaling that we call value learning. Our crowdsourcing guidelines will be publicly available to facilitate future work on factual consistency in summarization.

show abstract

FFCI: A Framework for Interpretable Automatic Evaluation of Summarization

Cited by 17 publications

References 129 publications

Learning to Revise References for Faithful Summarization

Learning to Revise References for Faithful Summarization

QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization

Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries

Contact Info

Product

Resources

About