Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering

Dang, Long Hoang; Le, Thao Minh; Le, Vuong; Tran, Truyen

doi:10.24963/ijcai.2021/88

Cited by 39 publications

(17 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, [Kim et al, 2019] proposed a multistep progressive attention model to prune out irrelevant temporal segments, and a memory network to progressively update the cues to answer. Additionally, some proposed to leverage object detection across the video frames to acquire fine-grained appearance-question interactions [Dang et al, 2021;. [Le et al, 2020] proposed to use a hierarchical structure for the extraction of question-video interactions from the frame-level and segment-level.…”

Section: Related Workmentioning

confidence: 99%

“…Different from some previous VideoQA works [Dang et al, 2021;Fan et al, 2019;Le et al, 2020;Park et al, 2021] that adopted dense sampling for the input video, we conduct a multiscale sampling to help acquire visual features at different temporal scales. For input video V, at scale n ∈ {1, ..., N }, we sample T × 2 n−1 frames along the forward temporal direction, with T as the size of our sampling window, which is set to 16 in our experiment.…”

Section: Multiscale Sampling and Feature Extractionmentioning

confidence: 99%

“…In addition, most methods use dense sampling for the input video, e.g., HCRN [Le et al, 2020] andBridge2Answer [Park et al, 2021] sampled 8 clips each comprising 16 frames, while our method with scale N set to 3 only samples 7 clips so that costing less computational loads. [Fan et al, 2019] 73.9 77.8 53.8 4.02 FAM [Cai et al, 2020] 75.4 79.2 56.9 3.79 L- GCN [Huang et al, 2020] 74.3 81.1 56.3 3.95 HGA [Jiang and Han, 2020] 75.4 81.0 55.1 4.09 HCRN [Le et al, 2020] 75.0 81.4 55.9 3.82 Bridge2Answer [Park et al, 2021] 75.9 82.6 57.5 3.71 HOSTR [Dang et al, 2021] 75 et al, 2018] 31,7 31.9 HME [Fan et al, 2019] 33.7 33.0 FAM [Cai et al, 2020] 34.5 33.2 HGA [Jiang and Han, 2020] 34.7 35.5 HCRN [Le et al, 2020] 36.1 35.6 Bridge2Answer [Park et al, 2021] 37.2 36.9 HOSTR [Dang et al, 2021] 39.4 35.9 Further comparisons on the MSVD-QA and MSRVTT-QA datasets are conducted. Results are reported in Table 2.…”

Section: Comparison With the State-of-the-artsmentioning

confidence: 99%

“…On such more challenging data, our MHN model still achieves the best performances of 40.4% and 38.6% on both datasets, respectively. While Bridge2Answer [Park et al, 2021] additionally extracted semantic dependencies from the question using a NLP tool and HOSTR [Dang et al, 2021] applied Fast R-CNN for object detection per frame, our model is able to produce even higher performances without such complex feature pre-processing.…”

Section: Comparison With the State-of-the-artsmentioning

confidence: 99%

“…The majority of existing methods [Dang et al, 2021;Kim et al, 2019;Le et al, 2020; Figure 1: (a) The multiscale property of a video example, where at a fine-grained scale the richer frames contribute to understanding general action and logical information, and the local attributes could be better inferred with fewer frames at a coarser scale. (b) The typical multilevel processing of a deep learning model, where the increase of feature levels leads to the transition of learning from local objects to global semantics.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering

Peng¹,

Wang²,

Gao³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language processing. While most existing approaches ignore the visual appearance-motion information at different temporal scales, it is unknown how to incorporate the multilevel processing capacity of a deep learning model with such multiscale information. Targeting these issues, this paper proposes a novel Multilevel Hierarchical Network (MHN) with multiscale sampling for VideoQA. MHN comprises two modules, namely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning (PVR). With a multiscale sampling, RMI iterates the interaction of appearance-motion information at each scale and the question embeddings to build the multilevel question-guided visual representations. Thereon, with a shared transformer encoder, PVR infers the visual cues at each level in parallel to fit with answering different question types that may rely on the visual information at relevant levels. Through extensive experiments on three VideoQA datasets, we demonstrate improved performances than previous state-of-the-arts and justify the effectiveness of each part of our method.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Multiscale Sampling and Feature Extractionmentioning

confidence: 99%

Section: Comparison With the State-of-the-artsmentioning

confidence: 99%

Section: Comparison With the State-of-the-artsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering

Peng¹,

Wang²,

Gao³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Video Dialog as Conversation About Objects Living in Space-Time

Pham

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Video Graph Transformer for Video Question Answering

Xiao

Zhou²,

Chua

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully-and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining. We hope our success can advance VideoQA beyond coarse recognition/description towards fine-grained relation reasoning of video contents. Our code will be available at https://github.com/doc-doc/CoVGT.

show abstract

Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering

Cited by 39 publications

References 30 publications

Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering

Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering

Video Dialog as Conversation About Objects Living in Space-Time

Video Graph Transformer for Video Question Answering

Contact Info

Product

Resources

About