2018
DOI: 10.1007/978-3-030-01234-2_29
|View full text |Cite
|
Sign up to set email alerts
|

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

Abstract: We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e.g. a video clip and a language sentence). Our multimodal matching network consists of two key components. First, the Joint Semantic Tensor composes a dense pairwise representation of two sequence data into a 3D tensor. Then, the Convolutional Hierarchical Decoder computes their similarity score by discovering hidden hierarchical matches between the two sequence mod… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
240
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 283 publications
(250 citation statements)
references
References 40 publications
1
240
0
Order By: Relevance
“…Furthermore, fine-tuning our model pre-trained on HowTo100M on YouCook2 results in a significant improvement of 13.7 % in R@10 against [25]. against prior work that directly uses MSR-VTT for training (reproduced in [63]) in Table 6. Our off-the-shelf HowTo100M model outperforms [22,24,53,64,65] that are directly trained on MSR-VTT.…”
Section: Comparison With State-of-the-artmentioning
confidence: 77%
See 2 more Smart Citations
“…Furthermore, fine-tuning our model pre-trained on HowTo100M on YouCook2 results in a significant improvement of 13.7 % in R@10 against [25]. against prior work that directly uses MSR-VTT for training (reproduced in [63]) in Table 6. Our off-the-shelf HowTo100M model outperforms [22,24,53,64,65] that are directly trained on MSR-VTT.…”
Section: Comparison With State-of-the-artmentioning
confidence: 77%
“…It contains 200k unique video clip-caption pairs, all annotated by paid human workers. We evaluate our model on the MSR-VTT clip retrieval test set used in [63] as performance of several other methods is reported on it.…”
Section: Datasets and Evaluation Setupsmentioning
confidence: 99%
See 1 more Smart Citation
“…Humans have an innate cognitive ability to infer from different sensory inputs to answer questions of 5W's and 1H involving who, what, when, where, why and how, and it has been a quest of mankind to duplicate this ability on machines. In recent years, studies on question answering (QA) have successfully benefited from deep neural networks, and showed remarkable performance improvement on textQA [24,30], imageQA [2,3,19,31], videoQA [8,11,32,34]. This paper considers movie story QA [15,18,21,26,29] that aims at a joint understanding of vision and language by answering questions about movie contents and storyline after observing temporally-aligned video and subtitle.…”
Section: Introductionmentioning
confidence: 99%
“…On the other hand, in the more sophisticated Video Retrieval where visual attributes, audio features, and narration text content are coupled, with standard protocol disappointingly absent [21], [22], partly due to the lack of high-quality training dataset and supporting information for queries [23]. Popular techniques such as deep learning and agents networks are often deployed to effect improvements in performance [24]- [27], and existing methods sometimes involve fusing distinct categories of information together through feature learning [28]. Likewise, research in the newly-emerged crossmedia retrieval attempts to project heterogeneous features into a common latent feature space to facilitate similarity computation [29]- [32].…”
Section: Introduction and Related Workmentioning
confidence: 99%