STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation

Akoury, Nader; Shu-fan, Wang; Whiting, Josh; Hood, Stephen; Peng, Nanyun; Iyyer, Mohit

doi:10.18653/v1/2020.emnlp-main.525

Cited by 54 publications

(51 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent work has also found conducting human evaluation for long-form generation to be challenging, for example in the context of question answering and story generation (Akoury et al, 2020). Our observations for data-to-text generation complement theirs and we hope that our dataset can inspire future research on human evaluation for long-form text generation.…”

Section: Methodssupporting

confidence: 65%

WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections

Chen

Wiseman

Gimpel

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

Datasets for data-to-text generation typically focus either on multi-domain, single-sentence generation or on single-domain, long-form generation. In this work, we cast generating Wikipedia sections as a data-to-text generation task and create a large-scale dataset, WIKITABLET, that pairs Wikipedia sections with their corresponding tabular data and various metadata. WIKITABLET contains millions of instances, covering a broad range of topics, as well as a variety of flavors of generation tasks with different levels of flexibility. We benchmark several training and decoding strategies on WIKITABLET. Our qualitative analysis shows that the best approaches can generate fluent and high quality texts but they struggle with coherence and factuality, showing the potential for our dataset to inspire future work on long-form generation. 1

show abstract

Section: Methodssupporting

confidence: 65%

WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections

Chen

Wiseman

Gimpel

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

show abstract

“…For example, TuringAdvice (Zellers et al, 2021) asks evaluators to rate NLG models by their ability to generate helpful advice, and RoFT (Dugan et al, 2020) engages evaluators through a guessing game to determine the boundary between human-and machine-generated text. Other evaluation methods ask the evaluators to directly interact with the generated text; for example, Choose Your Own Adventure (Clark and Smith, 2021) and Storium (Akoury et al, 2020) evaluate story generation models by having people write stories with the help of generated text. 11 We see that GPT3 can successfully mimic human-authored text across several domains, renewing the importance of evaluations that push beyond surface-level notions of quality and consider whether a text is helpful in a down-stream setting or has attributes that people would want from machine-generated text.…”

Section: Recommendationsmentioning

confidence: 99%

All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text

Clark¹,

August²,

Serrano³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

132

View full text Add to dashboard Cite

Human evaluations are typically considered the gold standard in natural language generation, but as models' fluency improves, how well can evaluators detect and judge machinegenerated text? We run a study assessing nonexperts' ability to distinguish between humanand machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3-and humanauthored text at random chance level. We explore three approaches for quickly training evaluators to better identify GPT3-authored text (detailed instructions, annotated examples, and paired examples) and find that while evaluators' accuracy improved up to 55%, it did not significantly improve across the three domains. Given the inconsistent results across text domains and the often contradictory reasons evaluators gave for their judgments, we examine the role untrained human evaluations play in NLG evaluation and provide recommendations to NLG researchers for improving human evaluations of text generated from state-of-the-art models.

show abstract

“…In addition to GENIE, multiple other related efforts exist that work toward the goal of reproducible and robust in-depth human and automatic evaluation for NLG tasks, and which focus on specific modeling-or task-aspects that are different from those in GEM. Among those are KILT (Petroni et al, 2020) which focuses on knowledge-intensive tasks and retrieval-based models, Storium (Akoury et al, 2020) which focuses on open-ended story generation, and BIG bench 3 which focuses on measuring few-shot and zero-shot capabilities of language models.…”

Section: Increasing Multilingualism Of Nlg Researchmentioning

confidence: 99%

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

Gehrmann¹,

Adewumi²,

Aggarwal³

et al. 2021

Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)

View full text Add to dashboard Cite

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with wellestablished, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for which we are organizing a shared task at our ACL 2021 Workshop and to which we invite the entire NLG community to participate.

show abstract

STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation

Cited by 54 publications

References 28 publications

WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections

WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections

All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

Contact Info

Product

Resources

About