Proceedings of the Seventh Joint Conference on Lexical And Computational Semantics 2018
DOI: 10.18653/v1/s18-2024
|View full text |Cite
|
Sign up to set email alerts
|

Quality Signals in Generated Stories

Abstract: We study the problem of measuring the quality of automatically-generated stories. We focus on the setting in which a few sentences of a story are provided and the task is to generate the next sentence ("continuation") in the story. We seek to identify what makes a story continuation interesting, relevant, and have high overall quality. We crowdsource annotations along these three criteria for the outputs of story continuation systems, design features, and train models to predict the annotations. Our trained sc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 13 publications
(10 citation statements)
references
References 34 publications
(26 reference statements)
0
9
0
Order By: Relevance
“…The inadequacies of existing human and automatic evaluation methods are a major roadblock for story generation research. Automatic evaluations correlate weakly with human judgments (Sagarkar et al, 2018), and these judgments are obtained from crowd workers who are not invested in the narratives they are assessing. These concerns are magnified with STORIUM, as the story contexts are far too long for crowd workers to reliably evaluate (Section 5).…”
Section: A Machine-in-the-loop Evaluation Platformmentioning
confidence: 99%
See 1 more Smart Citation
“…The inadequacies of existing human and automatic evaluation methods are a major roadblock for story generation research. Automatic evaluations correlate weakly with human judgments (Sagarkar et al, 2018), and these judgments are obtained from crowd workers who are not invested in the narratives they are assessing. These concerns are magnified with STORIUM, as the story contexts are far too long for crowd workers to reliably evaluate (Section 5).…”
Section: A Machine-in-the-loop Evaluation Platformmentioning
confidence: 99%
“…Finally, our STORIUM evaluation takes a different approach to prior research that measures the quality of generated stories. Sagarkar et al (2018) train an automatic scorer on human annotations of overall story quality, relevance, and interestingness based on evaluation criteria from (McIntyre and Lapata, 2009). See et al (2019) consider a number of diversity related measures for automated evaluation of story generation systems by focusing on the GPT-2 small model, noting that quality assessments are still best measured through human evaluation.…”
Section: Related Workmentioning
confidence: 99%
“…Previous work evaluates generation tasks with automatic metrics, such as perplexity (PPL), BLEU (Papineni et al, 2002), 3 and ROUGE (Lin, 2004). We adopt these in our evaluation and add three more metrics using the pretrained story scorer from Sagarkar et al (2018). The scorer rates a generated continuation given its context along three dimensions: relevance (R), interestingness (I), and overall quality (O).…”
Section: Discussionmentioning
confidence: 99%
“…Though all have strengths and weaknesses, ROUGE metrics (particularly ROUGE-L) are common for multisentence text evaluations. Textual metrics that consider specific qualities in the system outputs, like complexity and diversity, are also used to evaluate NLG systems (Dusek et al, 2019;Hashimoto et al, 2019;Sagarkar et al, 2018;Purdy et al, 2018). Word mover's distance has recently been used for NLP tasks like learning word embeddings (Zhang et al, 2017;Wu et al, 2018), textual entailment (Sulea, 2017), document similarity and classification (Kusner et al, 2015;Huang et al, 2016;Atasu et al, 2017), image captioning (Kilickaya et al, 2017), document retrieval (Balikas et al, 2018), clustering for semantic word-rank (Zhang and Wang, 2018), and as additional loss for text generation that measures the optimal transport between the generated hypothesis and reference text (Chen et al, 2019).…”
Section: Related Workmentioning
confidence: 99%