2017 IEEE International Conference on Computer Vision (ICCV) 2017
DOI: 10.1109/iccv.2017.159
|View full text |Cite
|
Sign up to set email alerts
|

Attentive Semantic Video Generation Using Captions

Abstract: This paper proposes a network architecture to perform variable length semantic video generation using captions. We adopt a new perspective towards video generation where we allow the captions to be combined with the long-term and short-term dependencies between video frames and thus generate a video in an incremental manner. Our experiments demonstrate our network architecture's ability to distinguish between objects, actions and interactions in a video and combine them to generate videos for unseen captions. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
33
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 55 publications
(33 citation statements)
references
References 17 publications
(29 reference statements)
0
33
0
Order By: Relevance
“…Mittal et al [23] first introduced this task and proposed a VAE-based framework, called Sync-DRAW, to encode simple captions and generate semantically consistent videos. A concurrent work [21] performs variable-length semantic video generation from captions. The model relies on VAE and recurrent neural networks (RNNs) to learn the longterm and short-term context of the video.…”
Section: Text-to-video Generationmentioning
confidence: 99%
“…Mittal et al [23] first introduced this task and proposed a VAE-based framework, called Sync-DRAW, to encode simple captions and generate semantically consistent videos. A concurrent work [21] performs variable-length semantic video generation from captions. The model relies on VAE and recurrent neural networks (RNNs) to learn the longterm and short-term context of the video.…”
Section: Text-to-video Generationmentioning
confidence: 99%
“…Prior work has explored supplementing relevant visual content to text stories based on the similarity of the story text to the annotation of images or video clips [16,17,44]. Similarly, built on large training datasets of visual content with text annotations, research has explored the generation of images [13], 3D scenes [2], and video sequence [22] from text with neural nets. Leake et al explored leveraging word concreteness, which measures how closely a word is related some perceptible concepts, to automatically create slideshows from Session 10A: Bridging Visual and Audio Content UIST '20, October 20-23, 2020, Virtual Event, USA text input, by composing images of the most concrete words [19] .…”
Section: Automatic Generation Of Visual Content From Textmentioning
confidence: 99%
“…Several more studies [13], [25], [15] adopted convolutional LSTM to take spatial and temporal contexts into account. [26] proposed a video generation framework which utilized the Conv-LSTM to encode short-term and long-term spatial-temporal context for semantic video generation using captions. [27] combined the ConvLSTM into a deep generative model which modeled the factorization of the joint likelihood of inputs in the form of video data.…”
Section: Related Workmentioning
confidence: 99%