Plot and Rework: Modeling Storylines for Visual Storytelling

Hsu, Chi-Yang; Chu, Yun-Wei; Huang, Ting-Hao; Ku, Lun-Wei

doi:10.18653/v1/2021.findings-acl.390

Cited by 12 publications

(8 citation statements)

References 34 publications

(35 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To our knowledge, none of these approaches make use of plan-based decoding. Hsu et al (2021) construct a graph representing the image sequence (based on training data and external resources) and identify the highest scoring path as the best storyline encapsulated therein. The storyline can be viewed as a form of planning, however, on the encoder side.…”

Section: Related Workmentioning

confidence: 99%

“…KG-Story (Hsu et al, 2020) predicts a set of words representative of the image sequence, enriches them using external knowledge graphs, and generates stories based on the enriched word set. PR-VIST (Hsu et al, 2021) is a state-of-the-art model which constructs a graph representing the relations between elements in the image sequence, identifies the best storyline captured therein, and proceeds to generate a story based on it. The process of constructing the story graph can be viewed as a form of planning.…”

Section: Comparison Systemsmentioning

confidence: 99%

“…Visual storytelling involves narrating an engaging and logically coherent story based on a sequence of images (see the example in Figure 1). The task lies at the intersection of natural language processing and computer vision and has recently attracted increasing interest from both communities Hsu et al, 2021;Xu et al, 2021;Hsu et al, 2020;Wang et al, 2020;Huang et al, 2016). Visual storytelling differs from image captioning, which typically focuses on generating descriptive text, e.g., by identifying and depicting objects within an image.…”

Section: Introductionmentioning

confidence: 99%

“…Subsequently, a decoder generates a story token by token based on the encoding of the image sequence. Recent work has mainly focused on enhancing the first stage of the generation process e.g., by leveraging external knowledge sources (Hsu et al, 2021;Hsu et al, 2020;Yang et al, 2019). Advanced representations for image sequences have also been explored, such as scene graphs (Hong et al, 2020) and story graphs (Hsu et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

“…Recent work has mainly focused on enhancing the first stage of the generation process e.g., by leveraging external knowledge sources (Hsu et al, 2021;Hsu et al, 2020;Yang et al, 2019). Advanced representations for image sequences have also been explored, such as scene graphs (Hong et al, 2020) and story graphs (Hsu et al, 2021). Despite recent progress, these methods struggle to produce meaningful narratives, are prone to hallucination and repetition, often generate vague sentences, and have difficulty identifying salient visual concepts.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Visual Storytelling with Question-Answer Plans

Liu,

Lapata,

Keller

2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Visual storytelling aims to generate compelling narratives from image sequences. Existing models often focus on enhancing the representation of the image sequence, e.g., with external knowledge sources or advanced graph structures. Despite recent progress, the stories are often repetitive, illogical, and lacking in detail. To mitigate these issues, we present a novel framework which integrates visual representations with pretrained language models and planning. Our model translates the image sequence into a visual prefix, a sequence of continuous embeddings which language models can interpret. It also leverages a sequence of question-answer pairs as a blueprint plan for selecting salient visual concepts and determining how they should be assembled into a narrative. Automatic and human evaluation on the VIST benchmark (Huang et al., 2016) demonstrates that blueprint-based models generate stories that are more coherent, interesting, and natural compared to competitive baselines and state-of-the-art systems.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Comparison Systemsmentioning

confidence: 99%