The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

Karpinska, Marzena; Akoury, Nader; Iyyer, Mohit

doi:10.48550/arxiv.2109.06835

Cited by 3 publications

(3 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, we rate all methods on human evaluation. We follow recent work on good evaluation practices for text generation approaches (Karpinska et al, 2021). Further details are in Appendix A.4.…”

Section: Human Evaluationmentioning

confidence: 99%

Long-term Control for Dialogue Generation: Methods and Evaluation

Ramakrishnan¹,

Narangodage²,

Schilman³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Current approaches for controlling dialogue response generation are primarily focused on high-level attributes like style, sentiment, or topic. In this work, we focus on constrained long-term dialogue generation, which involves more fine-grained control and requires a given set of control words to appear in generated responses. This setting requires a model to not only consider the generation of these control words in the immediate context, but also produce utterances that will encourage the generation of the words at some time in the (possibly distant) future. We define the problem of constrained long-term control for dialogue generation, identify gaps in current methods for evaluation, and propose new metrics that better measure long-term control. We also propose a retrieval-augmented method that improves performance of long-term controlled generation via logit modification techniques. We show through experiments on three task-oriented dialogue datasets that our metrics better assess dialogue control relative to current alternatives and that our method outperforms state-of-theart constrained generation baselines. 1

show abstract

Section: Human Evaluationmentioning

confidence: 99%

Long-term Control for Dialogue Generation: Methods and Evaluation

Ramakrishnan¹,

Narangodage²,

Schilman³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

show abstract

“…The low values of agreement between the raters are not surprising, indeed it is a common issue when performing subjective annotations in the context of social computing studies (Salminen, Al-Merekhi, Dey, & Jansen, 2018) or when rating emotion databases (Siegert, Böck, & Wendemuth, 2014), especially when using crowdsourcing (Karpinska, Akoury, & Iyyer, 2021). Inspired from suggestions in (Siegert et al, 2014) and (Karpinska et al, 2021), we limited the risk of high variance by providing context information and carefully checking the French-speaking requirement and the time spent to complete the annotation task.…”

Section: Scoresmentioning

confidence: 99%

Introducing the 3MT_French Dataset to Investigate the Timing of Public Speaking Judgements

Biancardi

Chollet

Clavel

2022

Preprint

View full text Add to dashboard Cite

In most public speaking datasets, judgements are given after watching the entire performance, or on thin slices randomly selected from the presentations, without focusing on the temporal location of these slices. This does not allow to investigate how people's judgements develop over time during presentations. This contrasts with primacy and recency theories, which suggest that some moments of the speech could be more salient than others and contribute disproportionately to the perception of the speaker's performance.To provide novel insights on this phenomenon, we present the 3MT_French dataset. It contains a set of public speaking annotations collected on a crowd-sourcing platform through a novel annotation scheme and protocol. Global evaluation, persuasiveness, perceived self-confidence of the speaker and audience engagement were annotated on different time windows (i.e., the beginning, middle or end of the presentation, or the full video). This new resource will be useful to researchers working on public speaking assessment and training. It will allow to fine-tune the analysis of presentations under a novel perspective relying on socio-cognitive theories rarely studied before in this context, such as first impressions and primacy and recency theories. An exploratory correlation analysis on the annotations provided in the dataset suggests that the early moments of a presentation have a stronger impact on the judgements.

show abstract

“…In the human evaluation processes, relying on a single perspective can introduce bias and instability in the results (Karpinska et al, 2021). Recognizing this, best practices often involve multiple human annotators collaborating in the evaluation (Van Der Lee et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

Yan Fu (1854-1921) and the Peking University = Yan Fu yu Beijing da xue

Chan¹,

陳禮財.²

View full text Add to dashboard Cite

Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human evaluation processes often involve multiple human annotators collaborating in the evaluation, we resort to a multi-agent debate framework, moving beyond single-agent prompting strategies. The multi-agentbased approach enables a group of LLMs to synergize with an array of intelligent counterparts, harnessing their distinct capabilities and expertise to enhance efficiency and effectiveness in handling intricate tasks. In this paper, we construct a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models on open-ended questions and traditional natural language generation (NLG) tasks. We derive insights and lessons from practical scenarios where humans instigate group discussions for brainstorming and propose different communication strategies within ChatEval. Our experiments on two benchmark tasks illustrate that ChatEval delivers superior accuracy and correlation in alignment with human assessment. Furthermore, we find that the diverse role prompts (different personas) are essential in the multi-agent debate process; that is, utilizing the same role description in the prompt can lead to a degradation in performance. Our qualitative analysis also shows that ChatEval transcends mere textual scoring, offering a humanmimicking evaluation process for reliable assessments. Our code is available at https://github.com/chanchimin/ChatEval.

show abstract

The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

Cited by 3 publications

References 32 publications

Long-term Control for Dialogue Generation: Methods and Evaluation

Long-term Control for Dialogue Generation: Methods and Evaluation

Introducing the 3MT_French Dataset to Investigate the Timing of Public Speaking Judgements

Yan Fu (1854-1921) and the Peking University = Yan Fu yu Beijing da xue

Contact Info

Product

Resources

About