2018
DOI: 10.1007/978-3-030-01237-3_37
|View full text |Cite
|
Sign up to set email alerts
|

Imagine This! Scripts to Compositions to Videos

Abstract: Imagining a scene described in natural language with realistic layout and appearance of entities is the ultimate test of spatial, visual, and semantic world knowledge. Towards this goal, we present the Composition, Retrieval and Fusion Network (Craft), a model capable of learning this knowledge from video-caption data and applying it while generating videos from novel captions. Craft explicitly predicts a temporal-layout of mentioned entities (characters and objects), retrieves spatio-temporal entity segments … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
36
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 51 publications
(37 citation statements)
references
References 26 publications
0
36
0
Order By: Relevance
“…For instance, story image retrieval from a pre-collected training set rather than image generation [26]. Cartoon generation has been explored with a "cut and paste" technique [11]. However, both of these techniques require large amounts of labeled training data.…”
Section: Related Workmentioning
confidence: 99%
“…For instance, story image retrieval from a pre-collected training set rather than image generation [26]. Cartoon generation has been explored with a "cut and paste" technique [11]. However, both of these techniques require large amounts of labeled training data.…”
Section: Related Workmentioning
confidence: 99%
“…Kim et al [16] performed pictorial generation from chat logs, while our work uses text which is considerably more underspecified. Gupta et al [9] proposed a semiparametric method to generate cartoon-like pictures. However the presented objects were also provided as inputs to the model, and the predictions of layouts, foregrounds and backgrounds were performed by separably trained modules.…”
Section: Related Workmentioning
confidence: 99%
“…While scene layout generation in this work predicts probability distributions for bounding box layout, it fails to model the stochasticity intrinsic in predicting each bounding box. Gupta et al [8] use an approach similar to [11] to predict layouts for generating videos from scripts. Johnson et al [12] uses the scene graph generated from the input sentence as input to the image generation model.…”
Section: Related Workmentioning
confidence: 99%