Multimodal Video Description

Ramanishka, Vasili; Das, Abir; Park, Dong Huk; Venugopalan, Subhashini; Hendricks, Lisa Anne; Rohrbach, Marcus; Saenko, Kate

doi:10.1145/2964284.2984066

Cited by 133 publications

(67 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several works have employed multimodal signals to caption the MSR-VTT dataset (Xu et al, 2016), which consists of 2K video clips from 20 general categories (e.g., "news", "sports") with an average duration of 10 seconds per clip. In particular, Ramanishka et al (2016) However -we suspect that instructional video domain is significantly different than MSR-VTT (where the audio information does not necessarily correspond to human speech), as we find that ASR-only models significantly surpass the state-of-the-art video model in our case. Palaskar et al (2019) and Shi et al (2019), contemporaneous with the submission of the present work, also examine ASR as a source of signal for generating how-to video captions.…”

Section: Related Workmentioning

confidence: 61%

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Hessel¹,

Pang²,

Zhu³

et al. 2019

Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

View full text Add to dashboard Cite

Instructional videos get high-traffic on video sharing platforms, and prior work suggests that providing time-stamped, subtask annotations (e.g., "heat the oil in the pan") improves user experiences. However, current automatic annotation methods based on visual features alone perform only slightly better than constant prediction. Taking cues from prior work, we show that we can improve performance significantly by considering automatic speech recognition (ASR) tokens as input. Furthermore, jointly modeling ASR tokens and visual features results in higher performance compared to training individually on either modality. We find that unstated background information is better explained by visual features, whereas fine-grained distinctions (e.g., "add oil" vs. "add olive oil") are disambiguated more easily via ASR tokens.

show abstract

Section: Related Workmentioning

confidence: 61%

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Hessel¹,

Pang²,

Zhu³

et al. 2019

Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

View full text Add to dashboard Cite

show abstract

“…More recently, different features can help characterizing the video semantic meaning from different perspectives. Many existing works utilize the motion information [42], temporal information [4,18,31], and even the audio information [51] to yield competitive performance. However, the diverse features in these works are simply concatenated with each other, which ignores the relationship among them.…”

Section: Video Captioningmentioning

confidence: 99%

“…In this subsection, we compare our method with the state-of-the-art methods with multiple features on benchmark datasets, including SA [53], M3 [47], v2t navigator [18], Aalto [36], VideoLab [31], MA-LSTM [51], M&M-TGM [4], PickNet [8], LSTM-TSA IV [28], SibNet [23], MGSA [5], and SCN-LSTM [14], most of which fuse different features by simply concatenating.…”

Section: Performance Comparisonsmentioning

confidence: 99%

Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network

Wang

Zhang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

157

View full text Add to dashboard Cite

In this paper, we propose to guide the video caption generation with Part-of-Speech (POS) information, based on a gated fusion of multiple representations of input videos. We construct a novel gated fusion network, with one particularly designed cross-gating (CG) block, to effectively encode and fuse different types of representations, e.g., the motion and content features of an input video. One POS sequence generator relies on this fused representation to predict the global syntactic structure, which is thereafter leveraged to guide the video captioning generation and control the syntax of the generated sentence. Specifically, a gating strategy is proposed to dynamically and adaptively incorporate the global syntactic POS information into the decoder for generating each word. Experimental results on two benchmark datasets, namely MSR-VTT and MSVD, demonstrate that the proposed model can well exploit complementary information from multiple representations, resulting in improved performances. Moreover, the generated global POS information can well capture the global syntactic structure of the sentence, and thus be exploited to control the syntactic structure of the description. Such POS information not only boosts the video captioning performance but also improves the diversity of the generated captions. Our code is at: https://github.com/vsislab/ Controllable_XGating. * This work was done while Bairui Wang was a Research Intern with Tencent AI Lab. † Corresponding authors.

show abstract

“…This indicates that it is beneficial to train our model using step-by-step learning. For MSR-VTT, we also compare our models with the top-3 results from the MSR-VTT challenge in the table 1 , including v2t-navigator (Jin et al 2016), Aalto (Shetty and Laaksonen 2016) and VideoLAB (Ramanishka et al 2016), which are all based on features from multiple cues such as action features and audio features. The experimental results presented in Table 1 show that our TDAM performs significantly better than other methods on all metrics.…”

Section: Comparison With the State-of-the-artmentioning

confidence: 99%

Video captioning with text-based dynamic attention and step-by-step learning

Xiao

Shi

2020

Pattern Recognition Letters

View full text Add to dashboard Cite

Automatically describing video content with natural language has been attracting much attention in CV and NLP communities. Most existing methods predict one word at a time, and by feeding the last generated word back as input at the next time, while the other generated words are not fully exploited. Furthermore, traditional methods optimize the model using all the training samples in each epoch without considering their learning situations, which leads to a lot of unnecessary training and can not target the difficult samples. To address these issues, we propose a text-based dynamic attention model named TDAM, which imposes a dynamic attention mechanism on all the generated words with the motivation to improve the context semantic information and enhance the overall control of the whole sentence. Moreover, the text-based dynamic attention mechanism and the visual attention mechanism are linked together to focus on the important words. They can benefit from each other during training. Accordingly, the model is trained through two steps: "starting from scratch" and "checking for gaps". The former uses all the samples to optimize the model, while the latter only trains for samples with poor control. Experimental results on the popular datasets MSVD and MSR-VTT demonstrate that our non-ensemble model outperforms the state-ofthe-art video captioning benchmarks.

show abstract

Multimodal Video Description

Cited by 133 publications

References 13 publications

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network

Video captioning with text-based dynamic attention and step-by-step learning

Contact Info

Product

Resources

About