InfoLM: A New Metric to Evaluate Summarization &amp; Data2Text Generation

Colombo, Pierre; Clavel, Chloé; Piantanida, Pablo

doi:10.1609/aaai.v36i10.21299

Cited by 8 publications

(5 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hashimoto et al (2019) used the same foundation to combine human and automatic evaluation in capturing the trade-off between sampling diverse outputs and achieving the highest possible quality. Pillutla et al (2021) and Colombo et al (2022) expand on these insights and a framework by Djolonga et al (2020) to compare the human-and model-distributions by measuring the extent to which they diverge. A similar approach based on information theory estimates the extent to which a generated summary helps reconstruct the article on which the summary is based (Egan et al, 2022).…”

Section: The Status Quomentioning

confidence: 99%

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Gehrmann

Clark

Sellam³

2023

jair

View full text Add to dashboard Cite

Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted. This issue has become more urgent, since neural generation models have improved to the point where their outputs can often no longer be distinguished based on the surface-level features that older metrics rely on. This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years. We summarize, categorize, and discuss how researchers have been addressing these issues and what their findings mean for the current state of model evaluations. Building on those insights, we lay out a long-term vision for evaluation research and propose concrete steps for researchers to improve their evaluation processes. Finally, we analyze 66 generation papers from recent NLP conferences in how well they already follow these suggestions and identify which areas require more drastic changes to the status quo.

show abstract

Section: The Status Quomentioning

confidence: 99%

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Gehrmann

Clark

Sellam³

2023

jair

View full text Add to dashboard Cite

show abstract

“…However, they can not compare two strings based on synonyms. InfoLM overcomes this drawback by using a pre-trained masked language model but does not require training to compute similarity scores between summaries and references over the vocabulary by discrete probability distributions [96].…”

Section: Summarization Evaluation Metricsmentioning

confidence: 99%

WikiDes: A Wikipedia-based dataset for generating short descriptions from paragraphs

Rahman²,

Majumder³

et al. 2023

Information Fusion

View full text Add to dashboard Cite

“…For future work, we plan to study OOD in sequence labelling tasks (Witon* et al, 2018;Colombo* et al, 2020;Chapuis* et al, 2020a;Colombo et al, 2021a), sequence generation (Colombo* et al, 2019;Jalalzai* et al, 2020;Modi et al, 2020;Colombo et al, 2021e) and fair classification (Colombo et al, 2021d;Pichler et al, 2022) and multimodal scenario (Garcia* et al, 2019;Dinkar* et al, 2020) as well as automatic evaluation (Colombo et al, 2021c;Colombo, 2021a;Staerman et al, 2021b).…”

Section: G Futures Applicationsmentioning

confidence: 99%

Unsupervised Layer-wise Score Aggregation for Textual OOD Detection

Darrin¹,

Staerman²,

Câmara³

et al. 2023

Preprint

View full text Add to dashboard Cite

Out-of-distribution (OOD) detection is a rapidly growing field due to new robustness and security requirements driven by an increased number of AI-based systems. Existing OOD textual detectors often rely on anomaly scores (e.g., Mahalanobis distance) computed on the embedding output of the last layer of the encoder. In this work, we observe that OOD detection performance varies greatly depending on the task and layer output. More importantly, we show that the usual choice (the last layer) is rarely the best one for OOD detection and that far better results can be achieved provided that an oracle selects the best layer. To leverage this observation, we propose a data-driven, unsupervised method to combine layer-wise anomaly scores. In addition, we extend classical textual OOD benchmarks by including classification tasks with a greater number of classes (up to 77), which reflects more realistic settings. On this augmented benchmark, we show that the proposed post-aggregation methods achieve robust and consistent results comparable to using the best layer according to an oracle while removing manual feature selection altogether.

show abstract

InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation

Cited by 8 publications

References 42 publications

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

WikiDes: A Wikipedia-based dataset for generating short descriptions from paragraphs

Unsupervised Layer-wise Score Aggregation for Textual OOD Detection

Contact Info

Product

Resources

About