Jointly Modeling Embedding and Translation to Bridge Video and Language

Pan, Yingwei; Mei, Tao; Yao, Ting; Li, Houqiang; Rui, Yong

doi:10.1109/cvpr.2016.497

Cited by 498 publications

(380 citation statements)

References 36 publications

(63 reference statements)

Supporting

Mentioning

377

Contrasting

Order By: Relevance

“…contrast to previous works [9,15], the concatenation of the two features does not yield any improvement. Hence, we rely on the Googlenet-bu4k feature as the visual input to the sentence generation module.…”

Section: Early Embeddingcontrasting

confidence: 46%

“…The pool5 layer after ReLU is used, resulting in a feature vector of 1,024 dimensions. Observing the increasing popularity of 3-D ConvNets (C3D) for video captioning [9,15], we have also experimented with a C3D model trained by Tran et al on one million sports videos [11]. Though being longer (4,096-dim), the C3D feature is inferior to the Googlenetbu4k feature according to our experiments.…”

Section: Video Representationmentioning

confidence: 99%

“…Following the common practice of applying pre-trained ConvNets for video content analysis [7,9,15], we extract ConvNet features for a given video clip. Frames are uniformly sampled from the clip with an interval of 10 frames.…”

Section: Video Representationmentioning

confidence: 99%

“…State-of-the-art approaches rely on a deep convolutional network with a recurrent neural network [13,15], and emphasize on innovating inside the network architecture [9,16]. We focus on enhancing video captioning, without the need to change internal structures of the networks.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Early Embedding and Late Reranking for Video Captioning

Dong

Lan

et al. 2016

Proceedings of the 24th ACM International Conference on Multimedia

View full text Add to dashboard Cite

This paper describes our solution for the MSR Video to Language Challenge. We start from the popular ConvNet + LSTM model, which we extend with two novel modules. One is early embedding, which enriches the current low-level input to LSTM by tag embeddings. The other is late reranking, for re-scoring generated sentences in terms of their relevance to a specific video. The modules are inspired by recent works on image captioning, repurposed and redesigned for video. As experiments on the MSR-VTT validation set show, the joint use of these two modules add a clear improvement over a non-trivial ConvNet + LSTM baseline under four performance metrics. The viability of the proposed solution is further confirmed by the blind test by the organizers. Our system is ranked at the 4th place in terms of overall performance, while scoring the best CIDEr-D, which measures the human-likeness of generated captions.

show abstract

Section: Early Embeddingcontrasting

confidence: 46%

Section: Video Representationmentioning

confidence: 99%

Section: Video Representationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Early Embedding and Late Reranking for Video Captioning

Dong

Lan

et al. 2016

Proceedings of the 24th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…Later Venugopalan et al (2015c) extended this work to extract CNN features from frames which are max-pooled over time. Pan et al (2016b) propose a framework that consists of a 2-/3-D CNN and LSTM trained jointly with a visual-semantic embedding to ensure better coherence between video and text. Xu et al (2015b) jointly address the language generation and video/language retrieval tasks by learning a joint embedding for a deep video model and a compositional semantic language model.…”

Section: Video Descriptionmentioning

confidence: 99%

Movie Description

et al. 2017

View full text Add to dashboard Cite

Audio description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. We introduce the Large Scale Movie Description Challenge (LSMDC) which contains a parallel corpus of 128,118 sentences aligned to video clips from 200 movies (around 150 h of video in total).

show abstract

Describing video scenarios using deep learning techniques

et al. 2021

View full text Add to dashboard Cite

The combination of computer vision and natural language processing is still a very challenging issue. In contrast to previous models focusing on generating only a single sentence for a video, we think that describing a longer video is an important application. In this paper, we propose a video scenario description system that considers video genres to generate multiple sentences. First, the semantics and genres of videos are analyzed. Next, video descriptions are also analyzed. Then, relevant semantic features are selected and translated into the corresponding video descriptions through deep learning. In the experiments, we compare the generated video descriptions based on four evaluation metrics. The results reveal our method is comparable with the state‐of‐the‐art methods.

show abstract

Jointly Modeling Embedding and Translation to Bridge Video and Language

Cited by 498 publications

References 36 publications

Early Embedding and Late Reranking for Video Captioning

Early Embedding and Late Reranking for Video Captioning

Movie Description

Describing video scenarios using deep learning techniques

Contact Info

Product

Resources

About