Proceedings of the 28th ACM International Conference on Multimedia 2020
DOI: 10.1145/3394171.3413649
|View full text |Cite
|
Sign up to set email alerts
|

Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering

Abstract: Video story question answering (video story QA) is a challenging problem, as it requires a joint understanding of diverse data sources (i.e., video, subtitle, question, and answer choices). Existing approaches for video story QA have several common defects: (1) single temporal scale; (2) static and rough multimodal interaction; and (3) insufficient (or shallow) exploitation of both question and answer choices. In this paper, we propose a novel framework named Dual Hierarchical Temporal Convolutional Network (D… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(2 citation statements)
references
References 42 publications
0
2
0
Order By: Relevance
“…Following previous studies (Li et al 2019b;Huang et al 2020;Fan et al 2019;Le et al 2020;Park, Lee, and Sohn 2021), different decoding methods are used according to the types of question. Specifically, we approach an open-ended question as a multi-class classification task, where the answering decoder aims to predict the correct class from the answer space A.…”
Section: Answer Decoding and Loss Computationmentioning
confidence: 99%
See 1 more Smart Citation
“…Following previous studies (Li et al 2019b;Huang et al 2020;Fan et al 2019;Le et al 2020;Park, Lee, and Sohn 2021), different decoding methods are used according to the types of question. Specifically, we approach an open-ended question as a multi-class classification task, where the answering decoder aims to predict the correct class from the answer space A.…”
Section: Answer Decoding and Loss Computationmentioning
confidence: 99%
“…Most existing methods (Lei et al 2018;Gao et al 2019;Liu et al 2020;Cai et al 2020;Jiang and Han 2020) use recurrent neural networks (RNNs) and their Figure 1: The interaction of the question and the visual content usually happens at multiple temporal scales, as illustrated by the connected pairs of different parts of the question and frames at different levels of the temporal pyramid.…”
Section: Introductionmentioning
confidence: 99%