Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence 2021
DOI: 10.24963/ijcai.2021/88
|View full text |Cite
|
Sign up to set email alerts
|

Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering

Abstract: Video Question Answering (Video QA) is a powerful testbed to develop new AI capabilities. This task necessitates learning to reason about objects, relations, and events across visual and linguistic domains in space-time. High-level reasoning demands lifting from associative visual pattern recognition to symbol like manipulation over objects, their behavior and interactions. Toward reaching this goal we propose an object-oriented reasoning approach in that video is abstracted as a dynamic stream of interacting … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
17
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 39 publications
(17 citation statements)
references
References 30 publications
0
17
0
Order By: Relevance
“…Recently, [Kim et al, 2019] proposed a multistep progressive attention model to prune out irrelevant temporal segments, and a memory network to progressively update the cues to answer. Additionally, some proposed to leverage object detection across the video frames to acquire fine-grained appearance-question interactions [Dang et al, 2021;. [Le et al, 2020] proposed to use a hierarchical structure for the extraction of question-video interactions from the frame-level and segment-level.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…Recently, [Kim et al, 2019] proposed a multistep progressive attention model to prune out irrelevant temporal segments, and a memory network to progressively update the cues to answer. Additionally, some proposed to leverage object detection across the video frames to acquire fine-grained appearance-question interactions [Dang et al, 2021;. [Le et al, 2020] proposed to use a hierarchical structure for the extraction of question-video interactions from the frame-level and segment-level.…”
Section: Related Workmentioning
confidence: 99%
“…Different from some previous VideoQA works [Dang et al, 2021;Fan et al, 2019;Le et al, 2020;Park et al, 2021] that adopted dense sampling for the input video, we conduct a multiscale sampling to help acquire visual features at different temporal scales. For input video V, at scale n ∈ {1, ..., N }, we sample T × 2 n−1 frames along the forward temporal direction, with T as the size of our sampling window, which is set to 16 in our experiment.…”
Section: Multiscale Sampling and Feature Extractionmentioning
confidence: 99%
See 3 more Smart Citations