End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering

Yu, Yang; Ko, Hyungjin; Choi, Jong-Wook; Kim, Gun-Hee

doi:10.1109/cvpr.2017.347

Cited by 202 publications

(136 citation statements)

References 29 publications

Supporting

Mentioning

129

Contrasting

Order By: Relevance

“…against prior work that directly uses MSR-VTT for training (reproduced in [63]) in Table 6. Our off-the-shelf HowTo100M model outperforms [22,24,53,64,65] that are directly trained on MSR-VTT. Here again, after fine-tuning the HowTo100M pre-trained model on MSR-VTT, we observe a significant improvement over the state-of-the-art JSFusion [63] trained on MSR-VTT.…”

Section: Comparison With State-of-the-artmentioning

confidence: 96%

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Miech¹,

Zhukov²,

Alayrac³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

739

873

View full text Add to dashboard Cite

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-ofthe-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models are publicly available [1].

show abstract

Section: Comparison With State-of-the-artmentioning

confidence: 96%

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Miech¹,

Zhukov²,

Alayrac³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

739

873

View full text Add to dashboard Cite

show abstract

“…VideoQA is considered to be a challenging problem as reasoning on video clip usually requires memorizing contextual information in temporal scale. Many models have been proposed to tackle this problem [5,10,27,[30][31][32]. Many work [5,10,30] utilized both motion (i.e.…”

Section: Related Workmentioning

confidence: 99%

Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

Cheng

Zhang

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

231

203

View full text Add to dashboard Cite

In this paper, we propose a novel end-to-end trainable Video Question Answering (VideoQA) framework with three major components: 1) a new heterogeneous memory which can effectively learn global context information from appearance and motion features; 2) a redesigned question memory which helps understand the complex semantics of question and highlights queried subjects; and 3) a new multimodal fusion layer which performs multi-step reasoning by attending to relevant visual and textual hints with selfupdated attention. Our VideoQA model firstly generates the global context-aware visual and textual features respectively by interacting current inputs with memory contents. After that, it makes the attentional fusion of the multimodal visual and textual representations to infer the correct answer. Multiple cycles of reasoning can be made to iteratively refine attention weights of the multimodal data and improve the final representation of the QA pair. Experimental results demonstrate our approach achieves state-of-theart performance on four VideoQA benchmark datasets.

show abstract

“…Recursive neural networks are investigated in [34] for vectorizing subject-verb-object triplets extracted from a given sentence. Variants of recurrent neural networks are being exploited, see the usage of LSTM, bidirectional LSTM, and Gated Recurrent Unit (GRU) in [37], [36], and [24], respectively. To the best of our knowledge, [7] is the only work looking to a joint use of multiple sentence encoding strategies including bag-of-words, word2vec and GRU.…”

Section: Related Workmentioning

confidence: 99%

Dual Encoding for Zero-Example Video Retrieval

Dong

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

249

199

View full text Add to dashboard Cite

This paper attacks the challenging problem of zeroexample video retrieval. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described in natural language text with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is required. The majority of existing methods are concept based, extracting relevant concepts from queries and videos and accordingly establishing associations between the two modalities. In contrast, this paper takes a concept-free approach, proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Dual encoding is conceptually simple, practically effective and endto-end. As experiments on three benchmarks, i.e. MSR-VTT, TRECVID 2016 and 2017 Ad-hoc Video Search show, the proposed solution establishes a new state-of-the-art for zero-example video retrieval.

show abstract

End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering

Cited by 202 publications

References 29 publications

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

Dual Encoding for Zero-Example Video Retrieval

Contact Info

Product

Resources

About