Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation

Ghazarian, Sarik; Liu, Zixi; Akash, S M; Weischedel, Ralph; Galstyan, Aram; Peng, Nanyun

doi:10.18653/v1/2021.naacl-main.343

Cited by 9 publications

(12 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore our model trained on human-written stories can hardly evaluate story coherence. To enable our model to evaluate story considering coherence issues, we further train our model (Ours (N)) with negative stories that are generated by the methods in the previous works (Guan and Huang, 2020;Ghazarian et al, 2021). We change the margin ranking loss as follow:…”

Section: Task 1: Preference Score Prediction (Ranking)mentioning

confidence: 99%

“…COH 200 . We use the same human collected data in the previous work (Ghazarian et al, 2021) 8 , which focused on recognizing coherence issues in the machine-generated stories (e.g., repeat plots, conflict logic).…”

Section: Correlation With Human Judgmentsmentioning

confidence: 99%

“…To show the generalization of evaluation metrics, we calculate the averaged predicted preference scores for data from different domains (see Table 7). We compute average scores on 1) lowlyvoted (low) and highly-voted stories (high) on both WP 200 and SCARY 200 , 2) machine-generated stories by LED (LED), and with Plan-and-Write strategy (Yao et al, 2019) (P&W) trained separately on the highly-upvoted and lowly-upvoted stories, 3) negative stories generated from previous works (Guan and Huang, 2020;Ghazarian et al, 2021), 4) stories from other datasets: fairy tales (short stories), childbook dataset (Hill et al, 2015) and bookcorpus (Zhu et al, 2015).…”

Section: Domain Transfer In Preference Scorementioning

confidence: 99%

See 2 more Smart Citations

StoryER: Automatic Story Evaluation via Ranking, Rating and Reasoning

Chen¹,

Vo²,

Takamura³

et al. 2022

Preprint

View full text Add to dashboard Cite

Existing automatic story evaluation methods place a premium on story lexical level coherence, deviating from human preference. We go beyond this limitation by considering a novel Story Evaluation method that mimics human preference when judging a story, namely StoryER, which consists of three subtasks: Ranking, Rating and Reasoning. Given either a machine-generated or a human-written story, StoryER requires the machine to output 1) a preference score that corresponds to human preference, 2) specific ratings and their corresponding confidences and 3) comments for various aspects (e.g., opening, charactershaping). To support these tasks, we introduce a well-annotated dataset comprising (i) 100k ranked story pairs; and (ii) a set of 46k ratings and comments on various aspects of the story. We finetune Longformer-Encoder-Decoder (LED) on the collected dataset, with the encoder responsible for preference score and aspect prediction and the decoder for comment generation. Our comprehensive experiments result in a competitive benchmark for each task, showing the high correlation to human preference. In addition, we have witnessed the joint learning of the preference scores, the aspect ratings, and the comments brings gain in each single task. Our dataset and benchmarks are publicly available to advance the research of story evaluation tasks. 1

show abstract

Section: Task 1: Preference Score Prediction (Ranking)mentioning

confidence: 99%

Section: Correlation With Human Judgmentsmentioning

confidence: 99%

Section: Domain Transfer In Preference Scorementioning

confidence: 99%

See 1 more Smart Citation

StoryER: Automatic Story Evaluation via Ranking, Rating and Reasoning

Chen¹,

Vo²,

Takamura³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Synthetic datasets Synthetic dataset construction has been shown to improve robustness of evaluation models (Gupta et al, 2021;Ghazarian et al, 2021) and improve the complexity of test sets (Sakaguchi et al, 2021;Feng et al, 2021). Synthetic claims have been explored in fact-checking to create adversarial and hard test sets.…”

Section: Consistency In Dialoguementioning

confidence: 99%

DialFact: A Benchmark for Fact-Checking in Dialogue

Gupta¹,

Wu²,

Liu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Fact-checking is an essential tool to mitigate the spread of misinformation and disinformation, however, it has been often explored to verify formal single-sentence claims instead of casual conversational claims. To study the problem, we introduce the task of fact-checking in dialogue. We construct DIALFACT, a testing benchmark dataset of 22,245 annotated conversational claims, paired with pieces of evidence from Wikipedia. There are three sub-tasks in DIALFACT: 1) Verifiable claim detection task distinguishes whether a response carries verifiable factual information; 2) Evidence retrieval task retrieves the most relevant Wikipedia snippets as evidence; 3) Claim verification task predicts a dialogue response to be supported, refuted, or not enough information. We found that existing fact-checking models trained on non-dialogue data like FEVER (Thorne et al., 2018) fail to perform well on our task, and thus, we propose a simple yet data-efficient solution to effectively improve fact-checking performance in dialogue. We point out unique challenges in DIALFACT such as handling the colloquialisms, coreferences and retrieval ambiguities in the error analysis to shed light on future research in this direction 1 .

show abstract

“…Furthermore, detection of sentences with event boundaries can also be useful when generating engaging stories with a good amount of surprises. (Yao et al, 2019;Rashkin et al, 2020;Ghazarian et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

Uncovering Surprising Event Boundaries in Narratives

Wang¹,

Jafarpour²,

Sap³

2022

Proceedings of the 4th Workshop of Narrative Understanding (WNU2022)

View full text Add to dashboard Cite

When reading stories, people can naturally identify sentences in which a new event starts, i.e., event boundaries, using their knowledge of how events typically unfold, but a computational model to detect event boundaries is not yet available. We characterize and detect sentences with expected or surprising event boundaries in an annotated corpus of short diary-like stories, using a model that combines commonsense knowledge and narrative flow features with a RoBERTa classifier. Our results show that, while commonsense and narrative features can help improve performance overall, detecting event boundaries that are more subjective remains challenging for our model. We also find that sentences marking surprising event boundaries are less likely to be causally related to the preceding sentence, but are more likely to express emotional reactions of story characters, compared to sentences with no event boundary.

show abstract

Plot-guided Adversarial Example Construction for Evaluating Open-domain Story Generation

Cited by 9 publications

References 22 publications

StoryER: Automatic Story Evaluation via Ranking, Rating and Reasoning

StoryER: Automatic Story Evaluation via Ranking, Rating and Reasoning

DialFact: A Benchmark for Fact-Checking in Dialogue

Uncovering Surprising Event Boundaries in Narratives

Contact Info

Product

Resources

About