All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text

Clark, Elizabeth; August, Tal; Serrano, Sofia; Haduong, Nikita; Gururangan, Suchin; Smith, Noah A.

doi:10.18653/v1/2021.acl-long.565

Cited by 132 publications

(142 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Large-scale deep neural network models have an extraordinary capacity to generate linguistic continuations of natural language prompts (5,8). The models provide the probability of words given a context captured by preceded sentences that is similar to human predictions (14).…”

Section: Discussionmentioning

confidence: 99%

Computational Lens on Cognition: Study Of Autobiographical Versus Imagined Stories With Large-Scale Language Models

Sap¹,

Jafarpour²,

Choi³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Lifelong experiences and learned knowledge lead to shared expectations about how common situations tend to unfold. Such knowledge enables people to interpret story narratives and identify salient events effortlessly. We study differences in the narrative flow of events in autobiographical versus imagined stories using GPT-3, one of the largest neural language models created to date. The diary-like stories were written by crowdworkers about either a recently experienced event or an imagined event on the same topic. To analyze the narrative flow of events of these stories, we measured sentence sequentiality, which compares the probability of a sentence with and without its preceding story context. We found that imagined stories have higher sequentiality than autobiographical stories, and that the sequentiality of autobiographical stories is higher when they are retold than when freshly recalled. Through an annotation of events in story sentences, we found that the story types contain similar proportions of major salient events, but that the autobiographical stories are denser in factual minor events. Furthermore, in comparison to imagined stories, autobiographical stories contain more concrete words and words related to the first person, cognitive processes, time, space, numbers, social words, and core drives and needs. Our findings highlight the opportunity to investigate memory and cognition with large-scale statistical language models.

show abstract

Section: Discussionmentioning

confidence: 99%

Computational Lens on Cognition: Study Of Autobiographical Versus Imagined Stories With Large-Scale Language Models

Sap¹,

Jafarpour²,

Choi³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Creative texts, such as stories, are less constrained than translated texts, but researchers continue to employ crowd workers to evaluate creative texts, often without evaluating reference texts (see Section 2). Previous studies have asked workers to choose from (Mori et al, 2019) or distinguish between human-written and machine-generated texts (Garbacea et al, 2019;Ippolito et al, 2020;Clark et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

Karpinska¹,

Akoury²,

Iyyer³

2021

Preprint

View full text Add to dashboard Cite

Recent text generation research has increasingly focused on open-ended domains such as story and poetry generation. Because models built for such tasks are difficult to evaluate automatically, most researchers in the space justify their modeling choices by collecting crowdsourced human judgments of text quality (e.g., Likert scores of coherence or grammaticality) from Amazon Mechanical Turk (AMT). In this paper, we first conduct a survey of 45 open-ended text generation papers and find that the vast majority of them fail to report crucial details about their AMT tasks, hindering reproducibility. We then run a series of story evaluation experiments with both AMT workers and English teachers and discover that even with strict qualification filters, AMT workers (unlike teachers) fail to distinguish between model-generated text and human-generated references. We show that AMT worker judgments improve when they are shown model-generated output alongside human-generated references, which enables the workers to better calibrate their ratings. Finally, interviews with the English teachers provide deeper insights into the challenges of the evaluation process, particularly when rating model-generated text.

show abstract

“…All those studies focus on asking (crowdsourced) human annotators to decide if a text was generated by a machine or a human. Clark et al (2021) points out that the high fluency of modern generation models, combined with a generally low expectation of what machines can accomplish, makes it hard to make this distinguished, even for lightly trained annotators.…”

Section: Related Workmentioning

confidence: 99%

Unsupervised and Distributional Detection of Machine-Generated Text

Gallé¹,

Rozen²,

Kruszewski³

et al. 2021

Preprint

View full text Add to dashboard Cite

The power of natural language generation models has provoked a flurry of interest in automatic methods to detect if a piece of text is human or machine-authored. The problem so far has been framed in a standard supervised way and consists in training a classifier on annotated data to predict the origin of one given new document. In this paper, we frame the problem in an unsupervised and distributional way: we assume that we have access to a large collection of unannotated documents, a big fraction of which is machine-generated.We propose a method to detect those machine-generated documents leveraging repeated higher-order n-grams, which we show over-appear in machine-generated text as compared to human ones. That weak signal is the starting point of a self-training setting where pseudo-labelled documents are used to train an ensemble of classifiers.Our experiments show that leveraging that signal allows us to rank suspicious documents accurately. Precision at 5 000 is over 90% for top-k sampling strategies, and over 80% for nucleus sampling for the largest model we used (GPT2-large). The drop with increased size of model is small, which could indicate that the results hold for other current and future large language models.

show abstract

All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text

Cited by 132 publications

References 28 publications

Computational Lens on Cognition: Study Of Autobiographical Versus Imagined Stories With Large-Scale Language Models

Computational Lens on Cognition: Study Of Autobiographical Versus Imagined Stories With Large-Scale Language Models

The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

Unsupervised and Distributional Detection of Machine-Generated Text

Contact Info

Product

Resources

About