Thibault Sellam scite author profile

Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgments. We propose BLEURT, a learned evaluation metric based on BERT that can model human judgments with a few thousand possibly biased training examples. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetic examples to help the model generalize. BLEURT provides state-ofthe-art results on the last three years of the WMT Metrics shared task and the WebNLG Competition dataset. In contrast to a vanilla BERT-based approach, it yields superior results even when the training data is scarce and out-of-distribution.

show abstract

BLEURT: Learning Robust Metrics for Text Generation

Sellam¹,

Das²,

Parikh³

2020

Preprint

View full text Add to dashboard Cite

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

Gehrmann¹,

Adewumi²,

Aggarwal³

et al. 2021

View full text Add to dashboard Cite

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with wellestablished, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for which we are organizing a shared task at our ACL 2021 Workshop and to which we invite the entire NLG community to participate.

show abstract

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

Gehrmann¹,

Adewumi²,

Aggarwal³

et al. 2021

Preprint

View full text Add to dashboard Cite

Sticking to the Facts: Confident Decoding for Faithful Data-to-Text Generation

Tian¹,

Narayan²,

Sellam³

et al. 2019

Preprint

View full text Add to dashboard Cite

Neural conditional text generation systems have achieved significant progress in recent years, showing the ability to produce highly fluent text. However, the inherent lack of controllability in these systems allows them to hallucinate factually incorrect phrases that are unfaithful to the source, making them often unsuitable for many real world systems that require high degrees of precision. In this work, we propose a novel confidence oriented decoder that assigns a confidence score to each target position. This score is learned in training using a variational Bayes objective, and can be leveraged at inference time using a calibration technique to promote more faithful generation. Experiments on a structured data-to-text dataset -WikiBio (Lebret et al., 2016) -show that our approach is more faithful to the source than existing state-of-the-art approaches, according to both automatic metrics and human evaluation.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Thibault Sellam

BLEURT: Learning Robust Metrics for Text Generation

BLEURT: Learning Robust Metrics for Text Generation

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

Sticking to the Facts: Confident Decoding for Faithful Data-to-Text Generation

Contact Info

Product

Resources

About