CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning

Tang, Xiangru; Nair, Arjun; Wang, Borui; Wang, Bingyao; Desai, Jai; Wade, Aaron; Li, Haoran; Çelikyılmaz, Aslı; Mehdad, Yashar; Radev, Dragomir

doi:10.18653/v1/2022.naacl-main.415

Cited by 17 publications

(22 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Factual consistency. As mentioned in Section 2, information inconsistency (Kryscinski et al, 2019(Kryscinski et al, , 2020 is a common problem of general text summarization systems, especially for the meeting domain (Tang et al, 2022). This suggests future research to focus on dealing with hallucinated content in generated meeting summaries.…”

Section: Future Directionsmentioning

confidence: 96%

“…tween a summary and its source (Huang et al, 2021). It is reported that nearly 30% of summaries generated by neural seq2seq models suffer from fact fabrication (Cao et al, 2018), and for the dialogue domain, most of the factual errors are related to dialogue flow modeling, informal interactions between speakers, and complex coreference resolution (Tang et al, 2022). Given the special characteristics of dialogues, more studies will be needed to develop more appropriate metrics for dialogue summarization (Zechner and Waibel, 2000).…”

Section: [Problems]mentioning

confidence: 99%

See 1 more Smart Citation

Abstractive Meeting Summarization: A Survey

Rennard¹,

Shang²,

Hunter³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent advances in deep learning, and especially the invention of encoder-decoder architectures, has significantly improved the performance of abstractive summarization systems. While the majority of research has focused on written documents, we have observed an increasing interest in the summarization of dialogues and multi-party conversation over the past few years. A system that could reliably transform the audio or transcript of a human conversation into an abridged version that homes in on the most important points of the discussion would be valuable in a wide variety of real-world contexts, from business meetings to medical consultations to customer service calls. This paper focuses on abstractive summarization for multi-party meetings, providing a survey of the challenges, datasets and systems relevant to this task and a discussion of promising directions for future study.

show abstract

Section: Future Directionsmentioning

confidence: 96%

Section: [Problems]mentioning

confidence: 99%

Abstractive Meeting Summarization: A Survey

Rennard¹,

Shang²,

Hunter³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Contrastive learning for faithfulness has been applied to fine-tuning (Nan et al, 2021b;Tang et al, 2022;Cao and Wang, 2021a), post-hoc editing (Cao et al, 2020;Zhu et al, 2021), re-ranking (Chen et al, 2021), and evaluation (Kryscinski et al, 2020;Deng et al, 2021a). This line of research has largely focused on the methods used to generate synthetic errors for negative contrast sets: i.e., by directly mimicking errors observed during human evaluation (Tang et al, 2022), entity swapping (Cao and Wang, 2021a), language model infilling (Cao and Wang, 2021a), or using unfaithful system outputs (Nan et al, 2021b). Orthogonal to our work, Cao and Wang (2021a) assess the relative efficacy of a diverse set of corruption methods when used for contrastive fine-tuning for faithfulness.…”

Section: Related Workmentioning

confidence: 99%

What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization

Adams¹,

Nguyen²,

Smith³

et al. 2023

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

Summarization models often generate text that is poorly calibrated to quality metrics because they are trained to maximize the likelihood of a single reference (MLE). To address this, recent work has added a calibration step, which exposes a model to its own ranked outputs to improve relevance or, in a separate line of work, contrasts positive and negative sets to improve faithfulness. While effective, much of this work has focused on how to generate and optimize these sets. Less is known about why one setup is more effective than another. In this work, we uncover the underlying characteristics of effective sets. For each training instance, we form a large, diverse pool of candidates and systematically vary the subsets used for calibration fine-tuning. Each selection strategy targets distinct aspects of the sets, such as lexical diversity or the size of the gap between positive and negatives. On three diverse scientific long-form summarization datasets (spanning biomedical, clinical, and chemical domains), we find, among others, that faithfulness calibration is optimal when the negative sets are extractive and more likely to be generated, whereas for relevance calibration, the metric margin between candidates should be maximized and surprise-the disagreement between model and metric defined candidate rankings-minimized. Code to create, select, and optimize calibration sets is available at https://github.com/ griff4692/calibrating-summaries. * Work started during internship with Microsoft Research.

show abstract

“…The SAMSum corpus is a largescale dialogue summarization dataset that contains 16k English daily conversations with corresponding summaries written by linguists. We use the human annotation of SAMSum summaries in ConFiT (Tang et al, 2022) as our meta-evaluation dataset, where they generate summaries from six summarization models and collected faithfulness score on a scale of 1-10. We refer to this dataset as MetaSAMSum.…”

Section: Metrics and Datamentioning

confidence: 99%

“…Kryscinski et al (2020) found that up to 30% of generated summaries are affected by factual inconsistencies. Tang et al (2022) studied types of factual errors generated by current models on popular dialogue summarization dataset and revealed hallucination issues. Thus having metrics that can reliably identify hallucinations and sourcecontradicting information becomes a critical step in summarization research.…”

Section: Introductionmentioning

confidence: 99%

ED-FAITH: Evaluating Dialogue Summarization on Faithfulness

Huang¹,

Çelikyılmaz²,

Li³

2022

Preprint

View full text Add to dashboard Cite

ive summarization models typically generate content unfaithful to the input, thus highlighting the significance of evaluating the faithfulness of generated summaries. Most faithfulness metrics are only evaluated on news domain, can they be transferred to other summarization tasks? In this work, we first present a systematic study of faithfulness metrics for dialogue summarization. We evaluate common faithfulness metrics on dialogue datasets and observe that most metrics correlate poorly with human judgements despite performing well on news datasets. Given these findings, to improve existing metrics' performance on dialogue summarization, we first finetune on in-domain dataset, then apply unlikelihood training on negative samples, and show that they can successfully improve metric performance on dialogue data. Inspired by the strong zero-shot performance of the T0 language model, we further propose T0-Score -a new metric for faithfulness evaluation, which shows consistent improvement against baseline metrics across multiple domains.

show abstract

CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning

Cited by 17 publications

References 19 publications

Abstractive Meeting Summarization: A Survey

Abstractive Meeting Summarization: A Survey

What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization

ED-FAITH: Evaluating Dialogue Summarization on Faithfulness

Contact Info

Product

Resources

About