2021
DOI: 10.48550/arxiv.2109.06835
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

Abstract: Recent text generation research has increasingly focused on open-ended domains such as story and poetry generation. Because models built for such tasks are difficult to evaluate automatically, most researchers in the space justify their modeling choices by collecting crowdsourced human judgments of text quality (e.g., Likert scores of coherence or grammaticality) from Amazon Mechanical Turk (AMT). In this paper, we first conduct a survey of 45 open-ended text generation papers and find that the vast majority o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 32 publications
0
3
0
Order By: Relevance
“…Finally, we rate all methods on human evaluation. We follow recent work on good evaluation practices for text generation approaches (Karpinska et al, 2021). Further details are in Appendix A.4.…”
Section: Human Evaluationmentioning
confidence: 99%
“…Finally, we rate all methods on human evaluation. We follow recent work on good evaluation practices for text generation approaches (Karpinska et al, 2021). Further details are in Appendix A.4.…”
Section: Human Evaluationmentioning
confidence: 99%
“…The low values of agreement between the raters are not surprising, indeed it is a common issue when performing subjective annotations in the context of social computing studies (Salminen, Al-Merekhi, Dey, & Jansen, 2018) or when rating emotion databases (Siegert, Böck, & Wendemuth, 2014), especially when using crowdsourcing (Karpinska, Akoury, & Iyyer, 2021). Inspired from suggestions in (Siegert et al, 2014) and (Karpinska et al, 2021), we limited the risk of high variance by providing context information and carefully checking the French-speaking requirement and the time spent to complete the annotation task.…”
Section: Scoresmentioning
confidence: 99%
“…In the human evaluation processes, relying on a single perspective can introduce bias and instability in the results (Karpinska et al, 2021). Recognizing this, best practices often involve multiple human annotators collaborating in the evaluation (Van Der Lee et al, 2019).…”
Section: Introductionmentioning
confidence: 99%