Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen 2019
DOI: 10.18653/v1/d19-1409
|View full text |Cite
|
Sign up to set email alerts
|

Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation

Abstract: We conduct a large-scale, systematic study to evaluate the existing evaluation methods for natural language generation in the context of generating online product reviews. We compare human-based evaluators with a variety of automated evaluation procedures, including discriminative evaluators that measure how well machine-generated text can be distinguished from human-written text, as well as word overlap metrics that assess how similar the generated text compares to human-written references. We determine to wh… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
20
1

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 19 publications
(21 citation statements)
references
References 64 publications
0
20
1
Order By: Relevance
“…Released by Google, Google LM is a language pre-trained model trained on a billion-word corpus, a publicly available dataset containing mainly news data [95], [96]. It is based on a two-layer LSTM with 8192 units in each layer, [97], [98].…”
Section: Machine Learning Algorithms For Text Generationmentioning
confidence: 99%
“…Released by Google, Google LM is a language pre-trained model trained on a billion-word corpus, a publicly available dataset containing mainly news data [95], [96]. It is based on a two-layer LSTM with 8192 units in each layer, [97], [98].…”
Section: Machine Learning Algorithms For Text Generationmentioning
confidence: 99%
“…Creative texts, such as stories, are less constrained than translated texts, but researchers continue to employ crowd workers to evaluate creative texts, often without evaluating reference texts (see Section 2). Previous studies have asked workers to choose from (Mori et al, 2019) or distinguish between human-written and machine-generated texts (Garbacea et al, 2019;Ippolito et al, 2020;Clark et al, 2021).…”
Section: Related Workmentioning
confidence: 99%
“…Recently, there have been many criticisms for existing metrics. Garbacea et al (2019) showed the poor generalization of discriminator-based metrics. Sai et al (2019) demonstrated ADEM is not robust to simple attacks such as simple word substitution or random word shuffle.…”
Section: Related Workmentioning
confidence: 99%
“…It is extremely important for learnable metrics to deal with model drift and dataset drift (Garbacea et al, 2019;Sellam et al, 2020). Specifically, a generalizable metric should be able to evaluate dif-ferent NLG models since the generation quality or inductive bias can vary significantly across models.…”
Section: Generalization Abilitymentioning
confidence: 99%