2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016
DOI: 10.1109/cvpr.2016.496
|View full text |Cite
|
Sign up to set email alerts
|

Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks

Abstract: We present an approach that exploits hierarchical Recurrent Neural Networks (RNNs) to tackle the video captioning problem, i.e., generating one or multiple sentences to describe a realistic video. Our hierarchical framework contains a sentence generator and a paragraph generator. The sentence generator produces one simple short sentence that describes a specific short video interval. It exploits both temporal-and spatial-attention mechanisms to selectively focus on visual elements during generation. The paragr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
369
0
1

Year Published

2016
2016
2023
2023

Publication Types

Select...
6
3
1

Relationship

1
9

Authors

Journals

citations
Cited by 502 publications
(380 citation statements)
references
References 38 publications
0
369
0
1
Order By: Relevance
“…Li et al (2015) experimented on LSTM autoencoders to show the power of the hierarchical structured LSTM networks to encode and decode long texts. And recent studies have succefully generated Chinese peotries and Song iambics (Wang et al, 2016) with hierarchical RNNs.…”
Section: Related Workmentioning
confidence: 99%
“…Li et al (2015) experimented on LSTM autoencoders to show the power of the hierarchical structured LSTM networks to encode and decode long texts. And recent studies have succefully generated Chinese peotries and Song iambics (Wang et al, 2016) with hierarchical RNNs.…”
Section: Related Workmentioning
confidence: 99%
“…We call this process sentence directed video object codiscovery. It can be viewed as the inverse of video captioning/description (Barbu et al 2012;Das et al 2013;Guadarrama et al 2013;Rohrbach et al 2014;Venugopalan et al 2015;Yu et al 2015Yu et al , 2016 where object evidence (in the form of detections or other visual features) is first produced by pretrained detectors and then sentences are generated given the object appearance and movement.…”
Section: Figmentioning
confidence: 99%
“…Ballas et al (2016) leverages multiple convolutional maps from different CNN layers to improve the visual representation for activity and video description. To model multi-sentence description, Yu et al (2016a) propose to use two stacked RNNs where the first one models words within a sentence and the second one, sentences within a paragraph. Yao et al (2016) has conducted an interesting study on performance upper bounds for both image and video description tasks on available datasets, including the LSMDC dataset.…”
Section: Concurrent and Consequent Workmentioning
confidence: 99%