2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016
DOI: 10.1109/cvpr.2016.497
|View full text |Cite
|
Sign up to set email alerts
|

Jointly Modeling Embedding and Translation to Bridge Video and Language

Abstract: Automatically describing video content with natural language is a fundamental challenge of multimedia. Recurrent Neural Networks (RNN), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the seman… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
377
1

Year Published

2016
2016
2022
2022

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 498 publications
(380 citation statements)
references
References 36 publications
(63 reference statements)
2
377
1
Order By: Relevance
“…contrast to previous works [9,15], the concatenation of the two features does not yield any improvement. Hence, we rely on the Googlenet-bu4k feature as the visual input to the sentence generation module.…”
Section: Early Embeddingcontrasting
confidence: 46%
See 3 more Smart Citations
“…contrast to previous works [9,15], the concatenation of the two features does not yield any improvement. Hence, we rely on the Googlenet-bu4k feature as the visual input to the sentence generation module.…”
Section: Early Embeddingcontrasting
confidence: 46%
“…The pool5 layer after ReLU is used, resulting in a feature vector of 1,024 dimensions. Observing the increasing popularity of 3-D ConvNets (C3D) for video captioning [9,15], we have also experimented with a C3D model trained by Tran et al on one million sports videos [11]. Though being longer (4,096-dim), the C3D feature is inferior to the Googlenetbu4k feature according to our experiments.…”
Section: Video Representationmentioning
confidence: 99%
See 2 more Smart Citations
“…Later Venugopalan et al (2015c) extended this work to extract CNN features from frames which are max-pooled over time. Pan et al (2016b) propose a framework that consists of a 2-/3-D CNN and LSTM trained jointly with a visual-semantic embedding to ensure better coherence between video and text. Xu et al (2015b) jointly address the language generation and video/language retrieval tasks by learning a joint embedding for a deep video model and a compositional semantic language model.…”
Section: Video Descriptionmentioning
confidence: 99%