Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation

Gârbacea, Cristina; Carton, Samuel; Yan, Shiyan; Mei, Qiaozhu

doi:10.18653/v1/d19-1409

Cited by 19 publications

(21 citation statements)

References 64 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Released by Google, Google LM is a language pre-trained model trained on a billion-word corpus, a publicly available dataset containing mainly news data [95], [96]. It is based on a two-layer LSTM with 8192 units in each layer, [97], [98].…”

Section: Machine Learning Algorithms For Text Generationmentioning

confidence: 99%

Adversarial Machine Learning in Text Processing: A Literature Survey

et al. 2022

View full text Add to dashboard Cite

Machine learning algorithms represent the intelligence that controls many information systems and applications around us. As such, they are targeted by attackers to impact their decisions. Text created by machine learning algorithms has many types of applications, some of which can be considered malicious especially if there is an intention to present machine-generated text as human-generated. In this paper, we surveyed major subjects in adversarial machine learning for text processing applications. Unlike adversarial machine learning in images, text problems and applications are heterogeneous. Thus, each problem can have its own challenges. We focused on some of the evolving research areas such as: malicious versus genuine text generation metrics, defense against adversarial attacks, and text generation models and algorithms. Our study showed that as applications of text generation will continue to grow in the near future, the type and nature of attacks on those applications and their machine learning algorithms will continue to grow as well. Literature survey indicated an increasing trend in using pre-trained models in machine learning. Word/sentence embedding models and transformers are examples of those pre-trained models. Adversarial models may utilize same or similar pre-trained models as well. In another trend related to text generation models, literature showed effort to develop universal text perturbations to be used in both black-and whitebox attack settings. Literature showed also using conditional GANs to create latent representation for writing types. This usage will allow for a seamless lexical and grammatical transition between various writing styles. In text generation metrics, research trends showed developing successful automated or semi-automated assessment metrics that may include human judgement. Literature showed also research trends of designing and developing new memory models that increase performance and memory utilization efficiency without validating real-time constraints. Many research efforts evaluate different defense model approaches and algorithms. Researchers evaluated different types of targeted attacks, and methods to distinguish human versus machine generated text.

show abstract

Section: Machine Learning Algorithms For Text Generationmentioning

confidence: 99%

Adversarial Machine Learning in Text Processing: A Literature Survey

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Creative texts, such as stories, are less constrained than translated texts, but researchers continue to employ crowd workers to evaluate creative texts, often without evaluating reference texts (see Section 2). Previous studies have asked workers to choose from (Mori et al, 2019) or distinguish between human-written and machine-generated texts (Garbacea et al, 2019;Ippolito et al, 2020;Clark et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

Karpinska¹,

Akoury²,

Iyyer³

2021

Preprint

View full text Add to dashboard Cite

Recent text generation research has increasingly focused on open-ended domains such as story and poetry generation. Because models built for such tasks are difficult to evaluate automatically, most researchers in the space justify their modeling choices by collecting crowdsourced human judgments of text quality (e.g., Likert scores of coherence or grammaticality) from Amazon Mechanical Turk (AMT). In this paper, we first conduct a survey of 45 open-ended text generation papers and find that the vast majority of them fail to report crucial details about their AMT tasks, hindering reproducibility. We then run a series of story evaluation experiments with both AMT workers and English teachers and discover that even with strict qualification filters, AMT workers (unlike teachers) fail to distinguish between model-generated text and human-generated references. We show that AMT worker judgments improve when they are shown model-generated output alongside human-generated references, which enables the workers to better calibrate their ratings. Finally, interviews with the English teachers provide deeper insights into the challenges of the evaluation process, particularly when rating model-generated text.

show abstract

“…Recently, there have been many criticisms for existing metrics. Garbacea et al (2019) showed the poor generalization of discriminator-based metrics. Sai et al (2019) demonstrated ADEM is not robust to simple attacks such as simple word substitution or random word shuffle.…”

Section: Related Workmentioning

confidence: 99%

“…It is extremely important for learnable metrics to deal with model drift and dataset drift (Garbacea et al, 2019;Sellam et al, 2020). Specifically, a generalizable metric should be able to evaluate dif-ferent NLG models since the generation quality or inductive bias can vary significantly across models.…”

Section: Generalization Abilitymentioning

confidence: 99%

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

Guan¹,

Huang²

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Despite the success of existing referenced metrics (e.g., BLEU and MoverScore), they correlate poorly with human judgments for openended text generation including story or dialog generation because of the notorious oneto-many issue: there are many plausible outputs for the same input, which may differ substantially in literal or semantics from the limited number of given references. To alleviate this issue, we propose UNION, a learnable UNreferenced metrIc for evaluating Open-eNded story generation, which measures the quality of a generated story without any reference. Built on top of BERT, UNION is trained to distinguish human-written stories from negative samples and recover the perturbation in negative stories. We propose an approach of constructing negative samples by mimicking the errors commonly observed in existing NLG models, including repeated plots, conflicting logic, and long-range incoherence. Experiments on two story datasets demonstrate that UNION is a reliable measure for evaluating the quality of generated stories, which correlates better with human judgments and is more generalizable than existing state-of-theart metrics. * Corresponding author Leading ContextJack was at the bar. Reference By HumanHe noticed a phone on the floor. He was going to take it to lost and found. But it started ringing on the way. Jack answered it and returned it to the owner's friends.Sample 1 (Reasonable, B=0.29, M=0.49, U=1.00) On the way out he noticed a phone on the floor. He asked around if anybody owned it. Eventually he gave it to the bartender. They put it into their lost and found box.Sample 2 (Reasonable, B=0.14, M=0.27, U=1.00) He had a drinking problem. He kept having more beers. After a while he passed out. When he waked up, he was surprised to find that he lost over a hundred dollars.Sample 3 (Unreasonable, B=0.20, M=0.35, U=0.00) He was going to get drunk and get drunk. The bartender told him it was already time to leave. Jack started drinking. Jack wound up returning but cops came on the way home.

show abstract

Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation

Cited by 19 publications

References 64 publications

Adversarial Machine Learning in Text Processing: A Literature Survey

Adversarial Machine Learning in Text Processing: A Literature Survey

The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

Contact Info

Product

Resources

About