A Joint Sequence Fusion Model for Video Question Answering and Retrieval

Yu, Yang; Kim, Jong-Seok; Kim, Gun-Hee

doi:10.1007/978-3-030-01234-2_29

Cited by 283 publications

(250 citation statements)

References 40 publications

Supporting

Mentioning

240

Contrasting

Order By: Relevance

“…Furthermore, fine-tuning our model pre-trained on HowTo100M on YouCook2 results in a significant improvement of 13.7 % in R@10 against [25]. against prior work that directly uses MSR-VTT for training (reproduced in [63]) in Table 6. Our off-the-shelf HowTo100M model outperforms [22,24,53,64,65] that are directly trained on MSR-VTT.…”

Section: Comparison With State-of-the-artmentioning

confidence: 77%

“…It contains 200k unique video clip-caption pairs, all annotated by paid human workers. We evaluate our model on the MSR-VTT clip retrieval test set used in [63] as performance of several other methods is reported on it.…”

Section: Datasets and Evaluation Setupsmentioning

confidence: 99%

“…This dataset is even more challenging as movie clips are quite distinct from HowTo100M videos. We compare against several other prior works that have been reproduced in [63] and are trained directly on LSMDC. Here again, we see that pre-training our model on HowTo100M and fine-tuning it on LSMDC also provides improvements upon a model directly trained on LSMDC.…”

Section: Comparison With State-of-the-artmentioning

confidence: 99%

See 2 more Smart Citations

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Miech¹,

Zhukov²,

Alayrac³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

739

873

View full text Add to dashboard Cite

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-ofthe-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models are publicly available [1].

show abstract

Section: Comparison With State-of-the-artmentioning

confidence: 77%

Section: Datasets and Evaluation Setupsmentioning

confidence: 99%

Section: Comparison With State-of-the-artmentioning

confidence: 99%

See 1 more Smart Citation

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Miech¹,

Zhukov²,

Alayrac³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

739

873

View full text Add to dashboard Cite

show abstract

“…Humans have an innate cognitive ability to infer from different sensory inputs to answer questions of 5W's and 1H involving who, what, when, where, why and how, and it has been a quest of mankind to duplicate this ability on machines. In recent years, studies on question answering (QA) have successfully benefited from deep neural networks, and showed remarkable performance improvement on textQA [24,30], imageQA [2,3,19,31], videoQA [8,11,32,34]. This paper considers movie story QA [15,18,21,26,29] that aims at a joint understanding of vision and language by answering questions about movie contents and storyline after observing temporally-aligned video and subtitle.…”

Section: Introductionmentioning

confidence: 99%

Progressive Attention Memory Network for Movie Story Question Answering

Kim

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

This paper proposes the progressive attention memory network (PAMN) for movie story question answering (QA). Movie story QA is challenging compared to VQA in two aspects: (1) pinpointing the temporal parts relevant to answer the question is difficult as the movies are typically longer than an hour, (2) it has both video and subtitle where different questions require different modality to infer the answer. To overcome these challenges, PAMN involves three main features: (1) progressive attention mechanism that utilizes cues from both question and answer to progressively prune out irrelevant temporal parts in memory, (2) dynamic modality fusion that adaptively determines the contribution of each modality for answering the current question, and (3) belief correction answering scheme that successively corrects the prediction score on each candidate answer. Experiments on publicly available benchmark datasets, MovieQA and TVQA, demonstrate that each feature contributes to our movie story QA architecture, PAMN, and improves performance to achieve the state-of-the-art result. Qualitative analysis by visualizing the inference mechanism of PAMN is also provided.

show abstract

“…On the other hand, in the more sophisticated Video Retrieval where visual attributes, audio features, and narration text content are coupled, with standard protocol disappointingly absent [21], [22], partly due to the lack of high-quality training dataset and supporting information for queries [23]. Popular techniques such as deep learning and agents networks are often deployed to effect improvements in performance [24]- [27], and existing methods sometimes involve fusing distinct categories of information together through feature learning [28]. Likewise, research in the newly-emerged crossmedia retrieval attempts to project heterogeneous features into a common latent feature space to facilitate similarity computation [29]- [32].…”

Section: Introduction and Related Workmentioning

confidence: 99%

Analysis of Evolutionary Behavior in Self-Learning Media Search Engines

Kuang

Clement

2019

2019 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

The diversity of intrinsic qualities of multimedia entities tends to impede their effective retrieval. In a Self-Learning Search Engine architecture, the subtle nuances of human perceptions and deep knowledge are taught and captured through unsupervised reinforcement learning, where the degree of reinforcement may be suitably calibrated. Such architectural paradigm enables indexes to evolve naturally while accommodating the dynamic changes of user interests. It operates by continuously constructing indexes over time, while injecting progressive improvement in search performance. For search operations to be effective, convergence of index learning is of crucial importance to ensure efficiency and robustness. In this paper, we develop a Self-Learning Search Engine architecture based on reinforcement learning using a Markov Decision Process framework. The balance between exploration and exploitation is achieved through evolutionary exploration Strategies. The evolutionary index learning behavior is then studied and formulated using stochastic analysis. Experimental results are presented which corroborate the steady convergence of the index evolution mechanism.

show abstract

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

Cited by 283 publications

References 40 publications

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Progressive Attention Memory Network for Movie Story Question Answering

Analysis of Evolutionary Behavior in Self-Learning Media Search Engines

Contact Info

Product

Resources

About