Quality Signals in Generated Stories

Sagarkar, Manasvi; Wieting, John; Tu, Lifu; Gimpel, Kevin

doi:10.18653/v1/s18-2024

Cited by 13 publications

(10 citation statements)

References 34 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The inadequacies of existing human and automatic evaluation methods are a major roadblock for story generation research. Automatic evaluations correlate weakly with human judgments (Sagarkar et al, 2018), and these judgments are obtained from crowd workers who are not invested in the narratives they are assessing. These concerns are magnified with STORIUM, as the story contexts are far too long for crowd workers to reliably evaluate (Section 5).…”

Section: A Machine-in-the-loop Evaluation Platformmentioning

confidence: 99%

“…Finally, our STORIUM evaluation takes a different approach to prior research that measures the quality of generated stories. Sagarkar et al (2018) train an automatic scorer on human annotations of overall story quality, relevance, and interestingness based on evaluation criteria from (McIntyre and Lapata, 2009). See et al (2019) consider a number of diversity related measures for automated evaluation of story generation systems by focusing on the GPT-2 small model, noting that quality assessments are still best measured through human evaluation.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation

Akoury¹,

Shu-fan²,

Whiting³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Systems for story generation are asked to produce plausible and enjoyable stories given an input context. This task is underspecified, as a vast number of diverse stories can originate from a single input. The large output space makes it difficult to build and evaluate story generation models, as (1) existing datasets lack rich enough contexts to meaningfully guide models, and (2) existing evaluations (both crowdsourced and automatic) are unreliable for assessing long-form creative text. To address these issues, we introduce a dataset and evaluation platform built from STORIUM, an online collaborative storytelling community. Our author-generated dataset contains 6K lengthy stories (125M tokens) with fine-grained natural language annotations (e.g., character goals and attributes) interspersed throughout each narrative, forming a robust source for guiding models. We evaluate language models fine-tuned on our dataset by integrating them onto STORIUM, where real authors can query a model for suggested story continuations and then edit them. Automatic metrics computed over these edits correlate well with both user ratings of generated stories and qualitative feedback from semi-structured user interviews. We release both the STORIUM dataset and evaluation platform to spur more principled research into story generation.

show abstract

Section: A Machine-in-the-loop Evaluation Platformmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation

Akoury¹,

Shu-fan²,

Whiting³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…Previous work evaluates generation tasks with automatic metrics, such as perplexity (PPL), BLEU (Papineni et al, 2002), 3 and ROUGE (Lin, 2004). We adopt these in our evaluation and add three more metrics using the pretrained story scorer from Sagarkar et al (2018). The scorer rates a generated continuation given its context along three dimensions: relevance (R), interestingness (I), and overall quality (O).…”

Section: Discussionmentioning

confidence: 99%

Generating Diverse Story Continuations with Controllable Semantics

Ding

Yu³

et al. 2019

Proceedings of the 3rd Workshop on Neural Generation and Translation

Self Cite

View full text Add to dashboard Cite

We propose a simple and effective modeling framework for controlled generation of multiple, diverse outputs. We focus on the setting of generating the next sentence of a story given its context. As controllable dimensions, we consider several sentence attributes, including sentiment, length, predicates, frames, and automatically-induced clusters. Our empirical results demonstrate: (1) our framework is accurate in terms of generating outputs that match the target control values;(2) our model yields increased maximum metric scores compared to standard n-best list generation via beam search; (3) controlling generation with semantic frames leads to a stronger combination of diversity and quality than other control variables as measured by automatic metrics. We also conduct a human evaluation to assess the utility of providing multiple suggestions for creative writing, demonstrating promising results for the potential of controllable, diverse generation in a collaborative writing system.

show abstract

“…Though all have strengths and weaknesses, ROUGE metrics (particularly ROUGE-L) are common for multisentence text evaluations. Textual metrics that consider specific qualities in the system outputs, like complexity and diversity, are also used to evaluate NLG systems (Dusek et al, 2019;Hashimoto et al, 2019;Sagarkar et al, 2018;Purdy et al, 2018). Word mover's distance has recently been used for NLP tasks like learning word embeddings (Zhang et al, 2017;Wu et al, 2018), textual entailment (Sulea, 2017), document similarity and classification (Kusner et al, 2015;Huang et al, 2016;Atasu et al, 2017), image captioning (Kilickaya et al, 2017), document retrieval (Balikas et al, 2018), clustering for semantic word-rank (Zhang and Wang, 2018), and as additional loss for text generation that measures the optimal transport between the generated hypothesis and reference text (Chen et al, 2019).…”

Section: Related Workmentioning

confidence: 99%

Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts

Clark

Çelikyılmaz

Smith

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

127

105

View full text Add to dashboard Cite

For evaluating machine-generated texts, automatic methods hold the promise of avoiding collection of human judgments, which can be expensive and time-consuming. The most common automatic metrics, like BLEU and ROUGE, depend on exact word matching, an inflexible approach for measuring semantic similarity. We introduce methods based on sentence mover's similarity; our automatic metrics evaluate text in a continuous space using word and sentence embeddings. We find that sentence-based metrics correlate with human judgments significantly better than ROUGE, both on machine-generated summaries (average length of 3.4 sentences) and human-authored essays (average length of 7.5). We also show that sentence mover's similarity can be used as a reward when learning a generation model via reinforcement learning; we present both automatic and human evaluations of summaries learned in this way, finding that our approach outperforms ROUGE.

show abstract

Quality Signals in Generated Stories

Cited by 13 publications

References 34 publications

STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation

STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation

Generating Diverse Story Continuations with Controllable Semantics

Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts

Contact Info

Product

Resources

About