2022
DOI: 10.1613/jair.1.13167
|View full text |Cite
|
Sign up to set email alerts
|

FFCI: A Framework for Interpretable Automatic Evaluation of Summarization

Abstract: In this paper, we propose FFCI, a framework for fine-grained summarization evaluation that comprises four elements: faithfulness (degree of factual consistency with the source), focus (precision of summary content relative to the reference), coverage (recall of summary content relative to the reference), and inter-sentential coherence (document fluency between adjacent sentences). We construct a novel dataset for focus, coverage, and inter-sentential coherence, and develop automatic methods for evaluating each… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 17 publications
(10 citation statements)
references
References 129 publications
0
8
0
Order By: Relevance
“…To capture coverage, we record faithful-adjusted recall (FaR): the fraction of non-hallucinated reference entities included in a model output (Shing et al, 2021). As in Koto et al (2020); Adams et al (2021), we approximate coherence with the average next-sentence prediction (NSP) probability between adjacent sentences (we rely on an in-domain model, ClinicalBERT (Alsentzer et al, 2019)).…”
Section: Methodsmentioning
confidence: 99%
“…To capture coverage, we record faithful-adjusted recall (FaR): the fraction of non-hallucinated reference entities included in a model output (Shing et al, 2021). As in Koto et al (2020); Adams et al (2021), we approximate coherence with the average next-sentence prediction (NSP) probability between adjacent sentences (we rely on an in-domain model, ClinicalBERT (Alsentzer et al, 2019)).…”
Section: Methodsmentioning
confidence: 99%
“…Quantitative comparisons among entailmentbased and QA-based metrics, however, often differ in their choices of baseline model and input granularity, evaluating on single datasets and drawing differing conclusions as to the best paradigm. For example, some work reports entailment-based metrics as performing best (Koto et al, 2020;Maynez et al, 2020), while other work argues for QA metrics (Durmus et al, 2020;Wang et al, 2020b;Scialom et al, 2021). Recently, Laban et al (2021) pro-posed a benchmark called SummaC to compare metrics across six factual consistency datasets for the task of binary factual consistency classification, whether a summary is entirely factually consistent or not.…”
Section: Entailment Matrixmentioning
confidence: 99%
“…Evaluating Factual Consistency Within entailment-based factual consistency evaluation, Falke et al (2019) propose the task of ranking summary pairs for factual consistency based on entailment models, while Kryscinski et al (2020) explore factual consistency classification jointly with source support or contradiction span extrac-tion. Other work on entailment-based metrics has examined input granularity (Goyal and Durrett, 2020), trained on adversarial datasets (Barrantes et al, 2020), and explored entailment-based models as the backbone of others metrics such as BERTScore (Zhang et al, 2020b) as in Koto et al (2021). Metric comparisons, however, were often conducted on isolated datasets.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, metrics have been proposed for evaluating factual consistency, including applying natural language inference (Falke et al, 2019;Kryscinski et al, 2020) and question-answering models (Eyal et al, 2019;Scialom et al, 2019;Durmus et al, 2020;Wang et al, 2020). However, current metrics still do not correlate highly with human judgments on factual consistency (Koto et al, 2020;Pagnoni et al, 2021). To overcome the inherent limitation of automatic metrics, researchers typically crowdsource human evaluations using platforms such as Amazon's Mechanical Turk (MTurk) (Gillick and Liu, 2010;Sabou et al, 2012;Lloret et al, 2013).…”
Section: Introductionmentioning
confidence: 99%