Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering

Liu, Fei; Liu, Jing; Zhu, Xinxin; Hong, Richang; Lu, Hanqing

doi:10.1145/3394171.3413649

Cited by 6 publications

(2 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Following previous studies (Li et al 2019b;Huang et al 2020;Fan et al 2019;Le et al 2020;Park, Lee, and Sohn 2021), different decoding methods are used according to the types of question. Specifically, we approach an open-ended question as a multi-class classification task, where the answering decoder aims to predict the correct class from the answer space A.…”

Section: Answer Decoding and Loss Computationmentioning

confidence: 99%

“…Most existing methods (Lei et al 2018;Gao et al 2019;Liu et al 2020;Cai et al 2020;Jiang and Han 2020) use recurrent neural networks (RNNs) and their Figure 1: The interaction of the question and the visual content usually happens at multiple temporal scales, as illustrated by the connected pairs of different parts of the question and frames at different levels of the temporal pyramid.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering

Peng¹,

Wang²,

Gao³

et al. 2021

Preprint

View full text Add to dashboard Cite

Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language understanding. While existing approaches seldom leverage the appearance-motion information in the video at multiple temporal scales, the interaction between the question and the visual information for textual semantics extraction is frequently ignored. Targeting these issues, this paper proposes a novel Temporal Pyramid Transformer (TPT) model with multimodal interaction for VideoQA. The TPT model comprises two modules, namely Question-specific Transformer (QT) and Visual Inference (VI). Given the temporal pyramid constructed from a video, QT builds the question semantics from the coarse-to-fine multimodal co-occurrence between each word and the visual content. Under the guidance of such question-specific semantics, VI infers the visual clues from the local-to-global multi-level interactions between the question and the video. Within each module, we introduce a multimodal attention mechanism to aid the extraction of question-video interactions, with residual connections adopted for the information passing across different levels. Through extensive experiments on three VideoQA datasets, we demonstrate better performances of the proposed method in comparison with the state-of-the-arts. Code available at https://github.com/Trunpm/TPT-for-VideoQA * Equal contribution. Under review.

show abstract