Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.525
|View full text |Cite
|
Sign up to set email alerts
|

STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation

Abstract: Systems for story generation are asked to produce plausible and enjoyable stories given an input context. This task is underspecified, as a vast number of diverse stories can originate from a single input. The large output space makes it difficult to build and evaluate story generation models, as (1) existing datasets lack rich enough contexts to meaningfully guide models, and (2) existing evaluations (both crowdsourced and automatic) are unreliable for assessing long-form creative text. To address these issue… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
43
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
7
2
1

Relationship

1
9

Authors

Journals

citations
Cited by 54 publications
(51 citation statements)
references
References 28 publications
2
43
0
Order By: Relevance
“…Recent work has also found conducting human evaluation for long-form generation to be challenging, for example in the context of question answering and story generation (Akoury et al, 2020). Our observations for data-to-text generation complement theirs and we hope that our dataset can inspire future research on human evaluation for long-form text generation.…”
Section: Methodssupporting
confidence: 65%
“…Recent work has also found conducting human evaluation for long-form generation to be challenging, for example in the context of question answering and story generation (Akoury et al, 2020). Our observations for data-to-text generation complement theirs and we hope that our dataset can inspire future research on human evaluation for long-form text generation.…”
Section: Methodssupporting
confidence: 65%
“…For example, TuringAdvice (Zellers et al, 2021) asks evaluators to rate NLG models by their ability to generate helpful advice, and RoFT (Dugan et al, 2020) engages evaluators through a guessing game to determine the boundary between human-and machine-generated text. Other evaluation methods ask the evaluators to directly interact with the generated text; for example, Choose Your Own Adventure (Clark and Smith, 2021) and Storium (Akoury et al, 2020) evaluate story generation models by having people write stories with the help of generated text. 11 We see that GPT3 can successfully mimic human-authored text across several domains, renewing the importance of evaluations that push beyond surface-level notions of quality and consider whether a text is helpful in a down-stream setting or has attributes that people would want from machine-generated text.…”
Section: Recommendationsmentioning
confidence: 99%
“…In addition to GENIE, multiple other related efforts exist that work toward the goal of reproducible and robust in-depth human and automatic evaluation for NLG tasks, and which focus on specific modeling-or task-aspects that are different from those in GEM. Among those are KILT (Petroni et al, 2020) which focuses on knowledge-intensive tasks and retrieval-based models, Storium (Akoury et al, 2020) which focuses on open-ended story generation, and BIG bench 3 which focuses on measuring few-shot and zero-shot capabilities of language models.…”
Section: Increasing Multilingualism Of Nlg Researchmentioning
confidence: 99%