Video as Conditional Graph Hierarchy for Multi-Granular Question Answering

Xiao, Junbin; Yao, Angela; Liu, Zhiyuan; Li, Yicong; Ji, Wei; Chua, Tat-Seng

doi:10.1609/aaai.v36i3.20184

Cited by 63 publications

(61 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, graphs constructed over object-level representations [19,36,47,60] have demonstrated superior performance, especially on benchmarks that emphasize visual relation reasoning [20,49,50,59]. However, these graph methods either construct monolithic graphs that do not disambiguate between relations in 1) space and time, 2) local and global scopes [19,57], or build static graphs at frame-level without explicitly capturing the temporal dynamics [36,42,60]. The monolithic graph is cumbersome to long videos where multiple objects interact in space-time.…”

Section: Related Workmentioning

confidence: 99%

“…For pretraining with weakly-paired video-text data, we adopt cross-modal matching as the major proxy task and optimize the model in a contrastive manner [44] along with masked language modelling [11]. Given a video, we sparsely sample l v frames in a way analogous to [60].…”

Section: Overviewmentioning

confidence: 99%

“…; ] denotes feature concatenation and f loc is obtained by applying a 1 × 1 convolution over the relative coordinates as in [60]. The function ϕ Wo denotes a linear transformation with parameters W o .…”

Section: Video Graph Representationmentioning

confidence: 99%

“…This advancement stems, in part from the success of multi-modal pretraining on web-scale vision-text data [8,21,31,34,38,44,52,53,54,63], and in part from the unified deep neural network that can well model both vision and natural language data, i.e., transformer [55]. As a typical multi-disciplinary AI task, Video Question Answering (VideoQA) has benefited a lot from these developments which helps to propel the field steadily forward over the use of purely conventional techniques [14,16,20,23,28,60,71].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Video Graph Transformer for Video Question Answering

Xiao¹,

Zhou²,

Chua³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

This paper proposes a Video Graph Transformer (VGT) model for Video Quetion Answering (VideoQA). VGT's uniqueness are two-fold: 1) it designs a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations, and dynamics for complex spatio-temporal reasoning; and 2) it exploits disentangled video and text Transformers for relevance comparison between the video and text to perform QA, instead of entangled crossmodal Transformer for answer classification. Vision-text communication is done by additional cross-modal interaction modules. With more reasonable video encoding and QA solution, we show that VGT can achieve much better performances on VideoQA tasks that challenge dynamic relation reasoning than prior arts in the pretraining-free scenario. Its performances even surpass those models that are pretrained with millions of external data. We further show that VGT can also benefit a lot from selfsupervised cross-modal pretraining, yet with orders of magnitude smaller data. These results clearly demonstrate the effectiveness and superiority of VGT, and reveal its potential for more data-efficient pretraining. With comprehensive analyses and some heuristic observations, we hope that VGT can promote VQA research beyond coarse recognition/description towards fine-grained relation reasoning in realistic videos. Our code is available at https://github.com/sail-sg/VGT.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Overviewmentioning

confidence: 99%

Section: Video Graph Representationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Video Graph Transformer for Video Question Answering

Xiao¹,

Zhou²,

Chua³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…bounding boxes in Figure 1 (a)). By describing activities with verb and grounded semantic roles, GSR provides a visually-grounded structure representation (named verb frame) for the activity, which benefits many downstream scene understanding tasks, such as image-text retrieval (Gordo et al 2016;Noh et al 2017), image captioning (Mallya and Lazebnik 2017;Chen et al 2021aChen et al , 2017, visual grounding (Chen et al 2021b), and VQA (Cadene et al 2019;Chen et al 2020Chen et al , 2021cXiao et al 2022).…”

Section: Introductionmentioning

confidence: 99%

Rethinking the Two-Stage Framework for Grounded Situation Recognition

Meng

Чэн

et al. 2022

AAAI

Self Cite

View full text Add to dashboard Cite

Grounded Situation Recognition (GSR), i.e., recognizing the salient activity (or verb) category in an image (e.g.,buying) and detecting all corresponding semantic roles (e.g.,agent and goods), is an essential step towards “human-like” event understanding. Since each verb is associated with a specific set of semantic roles, all existing GSR methods resort to a two-stage framework: predicting the verb in the first stage and detecting the semantic roles in the second stage. However, there are obvious drawbacks in both stages: 1) The widely-used cross-entropy (XE) loss for object recognition is insufficient in verb classification due to the large intra-class variation and high inter-class similarity among daily activities. 2) All semantic roles are detected in an autoregressive manner, which fails to model the complex semantic relations between different roles. To this end, we propose a novel SituFormerfor GSR which consists of a Coarse-to-Fine Verb Model (CFVM) and a Transformer-based Noun Model (TNM). CFVM is a two-step verb prediction model: a coarse-grained model trained with XE loss first proposes a set of verb candidates, and then a fine-grained model trained with triplet loss re-ranks these candidates with enhanced verb features (not only separable but also discriminative). TNM is a transformer-based semantic role detection model, which detects all roles parallelly. Owing to the global relation modeling ability and flexibility of the transformer decoder, TNM can fully explore the statistical dependency of the roles. Extensive validations on the challenging SWiG benchmark show that SituFormer achieves a new state-of-the-art performance with significant gains under various metrics. Code is available at https://github.com/kellyiss/SituFormer.

show abstract

A Competence-Aware Curriculum for Visual Concepts Learning via Question Answering

Li¹,

Huang²,

Hong³

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

While significant advancements have been made in video question answering (VideoQA), the potential benefits of enhancing model generalization through tailored difficulty scheduling have been largely overlooked in existing research. This paper seeks to bridge that gap by incorporating VideoQA into a curriculum learning (CL) framework that progressively trains models from simpler to more complex data. Recognizing that conventional self-paced CL methods rely on training loss for difficulty measurement, which might not accurately reflect the intricacies of video-question pairs, we introduce the concept of uncertainty-aware CL. Here, uncertainty serves as the guiding principle for dynamically adjusting the difficulty. Furthermore, we address the challenge posed by uncertainty by presenting a probabilistic modeling approach for VideoQA. Specifically, we conceptualize VideoQA as a stochastic computation graph, where the hidden representations are treated as stochastic variables. This yields two distinct types of uncertainty: one related to the inherent uncertainty in the data and another pertaining to the model's confidence. In practice, we seamlessly integrate the VideoQA model into our framework and conduct comprehensive experiments. The findings affirm that our approach not only achieves enhanced performance but also effectively quantifies uncertainty in the context of VideoQA.

show abstract

Video as Conditional Graph Hierarchy for Multi-Granular Question Answering

Cited by 63 publications

References 36 publications

Video Graph Transformer for Video Question Answering

Video Graph Transformer for Video Question Answering

Rethinking the Two-Stage Framework for Grounded Situation Recognition

A Competence-Aware Curriculum for Visual Concepts Learning via Question Answering

Contact Info

Product

Resources

About