Proceedings of the 24th ACM International Conference on Multimedia 2016
DOI: 10.1145/2964284.2984064
|View full text |Cite
|
Sign up to set email alerts
|

Early Embedding and Late Reranking for Video Captioning

Abstract: This paper describes our solution for the MSR Video to Language Challenge. We start from the popular ConvNet + LSTM model, which we extend with two novel modules. One is early embedding, which enriches the current low-level input to LSTM by tag embeddings. The other is late reranking, for re-scoring generated sentences in terms of their relevance to a specific video. The modules are inspired by recent works on image captioning, repurposed and redesigned for video. As experiments on the MSR-VTT validation set s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
14
0

Year Published

2017
2017
2019
2019

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 65 publications
(14 citation statements)
references
References 14 publications
(19 reference statements)
0
14
0
Order By: Relevance
“…We conduct experiments on the MSR-VTT dataset [51], which is a recently released large-scale video caption benchmark. This dataset contains 10,000 video clips (6,513 for training, 497 for validation and 2,990 for testing) from 20 categories, including news, sports, etc. Each video clip is manually annotated with 20 natural sentences.…”
Section: Dataset and Implementation Detailsmentioning
confidence: 99%
See 1 more Smart Citation
“…We conduct experiments on the MSR-VTT dataset [51], which is a recently released large-scale video caption benchmark. This dataset contains 10,000 video clips (6,513 for training, 497 for validation and 2,990 for testing) from 20 categories, including news, sports, etc. Each video clip is manually annotated with 20 natural sentences.…”
Section: Dataset and Implementation Detailsmentioning
confidence: 99%
“…Our baseline approach (the 2nd last row) is significantly better than these 3 methods. We also compare with the top-4 results from the MSR-VTT challenge in the table, including v2t navigator [15], Aalto [40], VideoLAB [34] and ruc uva [6] 2 , which are all based on features from multiple cues such as action features like C3D and audio features like Bag-of-Audio-Words (BoAW) [31]. Our baseline has on-par accuracy to the state-of-the-art methods.…”
Section: Ablation Studies On Single Sentence Captioningmentioning
confidence: 99%
“…MSR-VTT [57] is a recently released dataset. We compare performance of our approach on this dataset with the latest published models such as Alto [42], RUC-UVA [15], TDDF [61], PickNet [13], M 3 -VC [54] and RecNet local [52]. The results are summarized in Table 4.…”
Section: Results On Msr-vtt Datasetmentioning
confidence: 99%
“…We compare with two groups of baseline methods: 1) fundamental methods including S2VT [46] which shares a LSTM structure in both encoding and decoding phases, Mean-Pooling LSTM (MP-LSTM) [47] which performs a mean-pooling for all sampled visual frames as the input for a LSTM decoder and Soft-Attention LSTM (SA-LSTM) [61] which employs attention model to summarize visual features for decoding each word; 2) newly published state-of-the-art methods including RecNet [51] which refines the captioning by reconstructing the visual features from decoding hidden states, VideoLAB [34] which proposes to fuse source information of multiple modalities to improve the performance, PickNet [6] that picks the infor- mative frames based on a reinforcement learning framework, Aalto [37] that designs a evaluator model to pick the best caption from multiple candidate captions, and rucuva [10] which proposes to incorporate tag embeddings in encoding while designing a specific model to re-rank the candidate captions.…”
Section: Comparison On Msr-vttmentioning
confidence: 99%