Early Embedding and Late Reranking for Video Captioning

Dong, Jianfeng; Li, Xirong; Lan, Weiyu; Huo, Yujia; Snoek, Cees G. M.

doi:10.1145/2964284.2984064

Cited by 65 publications

(14 citation statements)

References 14 publications

(19 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We conduct experiments on the MSR-VTT dataset [51], which is a recently released large-scale video caption benchmark. This dataset contains 10,000 video clips (6,513 for training, 497 for validation and 2,990 for testing) from 20 categories, including news, sports, etc. Each video clip is manually annotated with 20 natural sentences.…”

Section: Dataset and Implementation Detailsmentioning

confidence: 99%

“…Our baseline approach (the 2nd last row) is significantly better than these 3 methods. We also compare with the top-4 results from the MSR-VTT challenge in the table, including v2t navigator [15], Aalto [40], VideoLAB [34] and ruc uva [6] 2 , which are all based on features from multiple cues such as action features like C3D and audio features like Bag-of-Audio-Words (BoAW) [31]. Our baseline has on-par accuracy to the state-of-the-art methods.…”

Section: Ablation Studies On Single Sentence Captioningmentioning

confidence: 99%

See 1 more Smart Citation

Weakly Supervised Dense Video Captioning

Shen

Li²,

Zhou³

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

117

View full text Add to dashboard Cite

This paper focuses on a novel and challenging vision task, dense video captioning, which aims to automatically describe a video clip with multiple informative and diverse caption sentences. The proposed method is trained without explicit annotation of fine-grained sentence to video regionsequence correspondence, but is only based on weak videolevel sentence annotations. It differs from existing video captioning systems in three technical aspects. First, we propose lexical fully convolutional neural networks (Lexical-FCN) with weakly supervised multi-instance multi-label learning to weakly link video regions with lexical labels. Second, we introduce a novel submodular maximization scheme to generate multiple informative and diverse regionsequences based on the Lexical-FCN outputs. A winnertakes-all scheme is adopted to weakly associate sentences to region-sequences in the training phase. Third, a sequenceto-sequence learning based language model is trained with the weakly supervised information obtained through the association process. We show that the proposed method can not only produce informative and diverse dense captions, but also outperform state-of-the-art single video captioning methods by a large margin.

show abstract

Section: Dataset and Implementation Detailsmentioning

confidence: 99%

Section: Ablation Studies On Single Sentence Captioningmentioning

confidence: 99%

Weakly Supervised Dense Video Captioning

Shen

Li²,

Zhou³

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

117

View full text Add to dashboard Cite

show abstract

“…MSR-VTT [57] is a recently released dataset. We compare performance of our approach on this dataset with the latest published models such as Alto [42], RUC-UVA [15], TDDF [61], PickNet [13], M 3 -VC [54] and RecNet local [52]. The results are summarized in Table 4.…”

Section: Results On Msr-vtt Datasetmentioning

confidence: 99%

Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning

Aafaq

Akhtar

Liu

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

196

117

View full text Add to dashboard Cite

Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs). Our method embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final representation is projected to a compact space and fed to a language model. By learning a relatively simple language model comprising two GRU layers, we establish new stateof-the-art on MSVD and MSR-VTT datasets for METEOR and ROUGE L metrics.

show abstract

“…We compare with two groups of baseline methods: 1) fundamental methods including S2VT [46] which shares a LSTM structure in both encoding and decoding phases, Mean-Pooling LSTM (MP-LSTM) [47] which performs a mean-pooling for all sampled visual frames as the input for a LSTM decoder and Soft-Attention LSTM (SA-LSTM) [61] which employs attention model to summarize visual features for decoding each word; 2) newly published state-of-the-art methods including RecNet [51] which refines the captioning by reconstructing the visual features from decoding hidden states, VideoLAB [34] which proposes to fuse source information of multiple modalities to improve the performance, PickNet [6] that picks the infor- mative frames based on a reinforcement learning framework, Aalto [37] that designs a evaluator model to pick the best caption from multiple candidate captions, and rucuva [10] which proposes to incorporate tag embeddings in encoding while designing a specific model to re-rank the candidate captions.…”

Section: Comparison On Msr-vttmentioning

confidence: 99%

Memory-Attended Recurrent Network for Video Captioning

Pei

Zhang

Wang

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

203

View full text Add to dashboard Cite

Typical techniques for video captioning follow the encoder-decoder framework, which can only focus on one source video being processed. A potential disadvantage of such design is that it cannot capture the multiple visual context information of a word appearing in more than one relevant videos in training data. To tackle this limitation, we propose the Memory-Attended Recurrent Network (MARN) for video captioning, in which a memory structure is designed to explore the full-spectrum correspondence between a word and its various similar visual contexts across videos in training data. Thus, our model is able to achieve a more comprehensive understanding for each word and yield higher captioning quality. Furthermore, the built memory structure enables our method to model the compatibility between adjacent words explicitly instead of asking the model to learn implicitly, as most existing models do. Extensive validation on two real-word datasets demonstrates that our MARN consistently outperforms state-of-the-art methods.

show abstract

Early Embedding and Late Reranking for Video Captioning

Cited by 65 publications

References 14 publications

Weakly Supervised Dense Video Captioning

Weakly Supervised Dense Video Captioning

Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning

Memory-Attended Recurrent Network for Video Captioning

Contact Info

Product

Resources

About