Show Me a Story: Towards Coherent Neural Story Illustration

Ravi, Hareesh; Wang, Lezi; Muñiz, Carlos Manuel; Sigal, Leonid; Metaxas, Dimitris N.; Kapadia, Mubbasir

doi:10.1109/cvpr.2018.00794

Cited by 29 publications

(28 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• CNSI [12]: global visual semantic matching model which utilizes hand-crafted coherence feature as encoder. • No Context [11]: the state-of-the-art dense visual semantic matching model for text-to-image retrieval.…”

Section: Quantitative Resultsmentioning

confidence: 99%

“…Table 1 presents the story-to-image retrieval performance of the four models on VIST testing set. The "No Context" model has achieved significant improvements over the previous CNSI [12] method, which is mainly contributed to the dense visual semantic matching with bottom-up region features instead of global matching. The CADM model without attention can boost the performance of "No Context" model with fixed context, which demonstrates the importance of contextual information for the story-to-image retrieval.…”

Section: Quantitative Resultsmentioning

confidence: 99%

“…To make fair comparison with the previous work [12], we utilize the Recall@K (R@K) as our evaluation metric on VIST dataset, which measures the percentage of sentences whose ground-truth images are in the top-K of retrieved images. We evaluate under K = 10, 50 and 100 as in [12]. For the GraphMovie testing dataset, since the number of candidate images for each sentence is less than 100, we evaluate R@K with K=1, 5, and 10.…”

Section: Experimental Setupsmentioning

confidence: 99%

See 2 more Smart Citations

Neural Storyboard Artist

Chen

Liu

et al. 2019

Proceedings of the 27th ACM International Conference on Multimedia

View full text Add to dashboard Cite

A storyboard is a sequence of images to illustrate a story containing multiple sentences, which has been a key process to create different story products. In this paper, we tackle a new multimedia task of automatic storyboard creation to facilitate this process and inspire human artists. Inspired by the fact that our understanding of languages is based on our past experience, we propose a novel inspire-and-create framework with a story-to-image retriever that selects relevant cinematic images for inspiration and a storyboard creator that further refines and renders images to improve the relevancy and visual consistency. The proposed retriever dynamically employs contextual information in the story with hierarchical attentions and applies dense visual-semantic matching to accurately retrieve and ground images. The creator then employs three rendering steps to increase the flexibility of retrieved images, which include erasing irrelevant regions, unifying styles of images and substituting consistent characters. We carry out extensive experiments on both in-domain and out-of-domain visual story datasets. The proposed model achieves better quantitative performance than the state-of-the-art baselines for storyboard creation. Qualitative visualizations and user studies further verify that our approach can create high-quality storyboards even for stories in the wild.

show abstract

Section: Quantitative Resultsmentioning

confidence: 99%

Section: Quantitative Resultsmentioning

confidence: 99%

Section: Experimental Setupsmentioning

confidence: 99%

See 1 more Smart Citation

Neural Storyboard Artist

Chen

Liu

et al. 2019

Proceedings of the 27th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…Despite being unsuitable for our inverse problem, VIST has also been used for retrieving images when given text, in work related to ours. In an approach called Coherent Neural Story Illustration (CNSI), an encoder-decoder network [27] was built to first encode sentences using a hierarchical two-level sentence-story gated recurrent unit (GRU), and then sequentially decode into a corresponding sequence of illustrative images. A previously proposed coherence model [24] was used to explicitly model co-references between sentences.…”

Section: Storymentioning

confidence: 99%

Variational Recurrent Sequence-to-Sequence Retrieval for Stepwise Illustration

Batra

Haldar

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

We address and formalise the task of sequence-to-sequence (seq2seq) cross-modal retrieval . Given a sequence of text passages as query, the goal is to retrieve a sequence of images that best describes and aligns with the query. This new task extends the traditional cross-modal retrieval, where each image-text pair is treated independently ignoring broader context. We propose a novel variational recurrent seq2seq (VRSS) retrieval model for this seq2seq task. Unlike most cross-modal methods, we generate an image vector corresponding to the latent topic obtained from combining the text semantics and context. This synthetic image embedding point associated with every text embedding point can then be employed for either image generation or image retrieval as desired. We evaluate the model for the application of stepwise illustration of recipes, where a sequence of relevant images are retrieved to best match the steps described in the text. To this end, we build and release a new Stepwise Recipe dataset for research purposes, containing 10K recipes (sequences of image-text pairs) having a total of 67K imagetext pairs. To our knowledge, it is the first publicly available dataset to offer rich semantic descriptions in a focused category such as food or recipes. Our model is shown to outperform several competitive and relevant baselines in the experiments. We also provide qualitative analysis of how semantically meaningful the results produced by our model are through human evaluation and comparison with relevant existing methods.

show abstract

“…With a rapid growth of multimedia data [23,30], understanding the visual content and interpreting it in natural language have been important yet challenging tasks, which could benefit a wide range of real-world applications, such as story telling [16,36,45], poetry creation [26,27,50,51] and support of the disabled. While deep learning techniques have made remarkable progress in describing visual content via image captioning [10,25,39,55], the obtained results are generally sentence-level, with fewer than twenty words.…”

Section: Introductionmentioning

confidence: 99%

Curiosity-driven Reinforcement Learning for Diverse Visual Paragraph Generation

Luo

Huang

Zhang

et al. 2019

Proceedings of the 27th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Visual paragraph generation aims to automatically describe a given image from different perspectives and organize sentences in a coherent way. In this paper, we address three critical challenges for this task in a reinforcement learning setting: the mode collapse, the delayed feedback, and the time-consuming warm-up for policy networks. Generally, we propose a novel Curiosity-driven Reinforcement Learning (CRL) framework to jointly enhance the diversity and accuracy of the generated paragraphs. First, by modeling the paragraph captioning as a long-term decision-making process and measuring the prediction uncertainty of state transitions as intrinsic rewards, the model is incentivized to memorize precise but rarely spotted descriptions to context, rather than being biased towards frequent fragments and generic patterns. Second, since the extrinsic reward from evaluation is only available until the complete paragraph is generated, we estimate its expected value at each time step with temporal-difference learning, by considering the correlations between successive actions. Then the estimated extrinsic rewards are complemented by dense intrinsic rewards produced from the derived curiosity module, in order to encourage the policy to fully explore action space and find a global optimum. Third, discounted imitation learning is integrated for learning from human demonstrations, without separately performing the timeconsuming warm-up in advance. Extensive experiments conducted on the Standford image-paragraph dataset demonstrate the effectiveness and efficiency of the proposed method, improving the performance by 38.4% compared with state-of-the-art.

show abstract

Show Me a Story: Towards Coherent Neural Story Illustration

Cited by 29 publications

References 21 publications

Neural Storyboard Artist

Neural Storyboard Artist

Variational Recurrent Sequence-to-Sequence Retrieval for Stepwise Illustration

Curiosity-driven Reinforcement Learning for Diverse Visual Paragraph Generation

Contact Info

Product

Resources

About