Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume 2021
DOI: 10.18653/v1/2021.eacl-main.25
|View full text |Cite
|
Sign up to set email alerts
|

Evaluating the Evaluation of Diversity in Natural Language Generation

Abstract: Despite growing interest in natural language generation (NLG) models that produce diverse outputs, there is currently no principled method for evaluating the diversity of an NLG system. In this work, we propose a framework for evaluating diversity metrics. The framework measures the correlation between a proposed diversity metric and a diversity parameter, a single parameter that controls some aspect of diversity in generated text. For example, a diversity parameter might be a binary variable used to instruct … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
29
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 18 publications
(33 citation statements)
references
References 26 publications
(14 reference statements)
0
29
0
Order By: Relevance
“…We use sent-BERT as an output diversity metric by using the cosine distance instead of cosine similarity. Our motivation in choosing these diversity metrics is from Tevet and Berant (2020), who identify dist-n and sent-BERT as the best metrics to evaluate two targeted types of diversity-diverse word choice and diverse content, respectively.…”
Section: Discussionmentioning
confidence: 99%
“…We use sent-BERT as an output diversity metric by using the cosine distance instead of cosine similarity. Our motivation in choosing these diversity metrics is from Tevet and Berant (2020), who identify dist-n and sent-BERT as the best metrics to evaluate two targeted types of diversity-diverse word choice and diverse content, respectively.…”
Section: Discussionmentioning
confidence: 99%
“…2) Reference-free metrics directly evaluate the quality of generated texts without references. Since unsupervised metrics like perplexity (Brown et al, 1992) and distinct n-grams (Li et al, 2016) can only provide a task-agnostic result which correlates weakly with human judgments (Hashimoto et al, 2019;Tevet and Berant, 2021), most of the reference-free metrics resort to supervised models. Specifically, they are trained to fit human-annotated ratings / labels (such as discriminator scores (Shen et al, 2017)) or distinguish human-written texts from negative samples (such as UNION (Guan and Huang, 2020)).…”
Section: Evaluation Metric For Text Generationmentioning
confidence: 99%
“…In spite of growing interest in NLG models that produce diverse outputs, there is currently no principled neu-ral method for evaluating the diversity of an NLG system. As described in Tevet and Berant (2021), existing automatic diversity metrics (e.g. Self-BLEU) perform worse than humans on the task of estimating content diversity, indicating a low correlation between metrics and human judgments.…”
Section: Future Directionsmentioning
confidence: 99%
“…An important desideratum of natural language generation (NLG) is to produce outputs that are not only correct but also diverse (Tevet and Berant, 2021). The term "diversity" in NLG is defined as the ability of a generative model to create a set of possible outputs that are each valid given the input and vary as widely as possible in terms of content, language style, and word variability (Gupta et al, 2018).…”
Section: Introductionmentioning
confidence: 99%