2022
DOI: 10.1609/aaai.v36i3.20184
|View full text |Cite
|
Sign up to set email alerts
|

Video as Conditional Graph Hierarchy for Multi-Granular Question Answering

Abstract: Video question answering requires the models to understand and reason about both the complex video and language data to correctly derive the answers. Existing efforts have been focused on designing sophisticated cross-modal interactions to fuse the information from two modalities, while encoding the video and question holistically as frame and word sequences. Despite their success, these methods are essentially revolving around the sequential nature of video- and question-contents, providing little insight to … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
58
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
1
1

Relationship

3
4

Authors

Journals

citations
Cited by 63 publications
(61 citation statements)
references
References 36 publications
0
58
0
Order By: Relevance
“…Recently, graphs constructed over object-level representations [19,36,47,60] have demonstrated superior performance, especially on benchmarks that emphasize visual relation reasoning [20,49,50,59]. However, these graph methods either construct monolithic graphs that do not disambiguate between relations in 1) space and time, 2) local and global scopes [19,57], or build static graphs at frame-level without explicitly capturing the temporal dynamics [36,42,60]. The monolithic graph is cumbersome to long videos where multiple objects interact in space-time.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…Recently, graphs constructed over object-level representations [19,36,47,60] have demonstrated superior performance, especially on benchmarks that emphasize visual relation reasoning [20,49,50,59]. However, these graph methods either construct monolithic graphs that do not disambiguate between relations in 1) space and time, 2) local and global scopes [19,57], or build static graphs at frame-level without explicitly capturing the temporal dynamics [36,42,60]. The monolithic graph is cumbersome to long videos where multiple objects interact in space-time.…”
Section: Related Workmentioning
confidence: 99%
“…For pretraining with weakly-paired video-text data, we adopt cross-modal matching as the major proxy task and optimize the model in a contrastive manner [44] along with masked language modelling [11]. Given a video, we sparsely sample l v frames in a way analogous to [60].…”
Section: Overviewmentioning
confidence: 99%
See 2 more Smart Citations
“…bounding boxes in Figure 1 (a)). By describing activities with verb and grounded semantic roles, GSR provides a visually-grounded structure representation (named verb frame) for the activity, which benefits many downstream scene understanding tasks, such as image-text retrieval (Gordo et al 2016;Noh et al 2017), image captioning (Mallya and Lazebnik 2017;Chen et al 2021aChen et al , 2017, visual grounding (Chen et al 2021b), and VQA (Cadene et al 2019;Chen et al 2020Chen et al , 2021cXiao et al 2022).…”
Section: Introductionmentioning
confidence: 99%