Progressive Graph Attention Network for Video Question Answering

Peng, Liang; Yang, Shuangji; Bin, Yi; Wang, Guoqing

doi:10.1145/3474085.3475193

Cited by 28 publications

(13 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MERLOT [70]) by about 1.0%, yet without using external data for cross-modal pretraining. On TGIF-QA-R [42] which is curated by making the negative answers in TGIF-QA more challenging, we can also observe remarkable improvements. Besides, VGT also achieves competitive results on normal descriptive QA tasks as defined in FrameQA and MSRVTT-QA though they are not our focus.…”

Section: Sate-of-the-art Comparisonmentioning

confidence: 85%

“…Recently, graphs constructed over object-level representations [19,36,47,60] have demonstrated superior performance, especially on benchmarks that emphasize visual relation reasoning [20,49,50,59]. However, these graph methods either construct monolithic graphs that do not disambiguate between relations in 1) space and time, 2) local and global scopes [19,57], or build static graphs at frame-level without explicitly capturing the temporal dynamics [36,42,60]. The monolithic graph is cumbersome to long videos where multiple objects interact in space-time.…”

Section: Related Workmentioning

confidence: 99%

“…Results on TGIF-QA and MSVTT-QA. † denotes TGIF-QA-R [42] whose multiple choices for repeated action and state transition are more challenging. We grey out the results reported in [42] regarding these two sub-tasks, because the candidate answers are slightly different as we have further rectified the redundant choices.…”

Section: Model Analysismentioning

confidence: 99%

“…For better comparison with previous works, we focus on the multi-choice QA task in NExT-QA [59] though it has also defined open-ended QA. For TGIF-QA [20], we also conduct experiments on a latest version [42] which generates more challenging negative answers for each question in the multi-choice tasks. In particular, we further fix the 'redundant answer' issue as we find that there are about 10% of questions have redundant candidate answers and some of the candidate answers are even identical to the correct one.…”

Section: A Data Statisticsmentioning

confidence: 99%

See 3 more Smart Citations

Video Graph Transformer for Video Question Answering

Xiao¹,

Zhou²,

Chua³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper proposes a Video Graph Transformer (VGT) model for Video Quetion Answering (VideoQA). VGT's uniqueness are two-fold: 1) it designs a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations, and dynamics for complex spatio-temporal reasoning; and 2) it exploits disentangled video and text Transformers for relevance comparison between the video and text to perform QA, instead of entangled crossmodal Transformer for answer classification. Vision-text communication is done by additional cross-modal interaction modules. With more reasonable video encoding and QA solution, we show that VGT can achieve much better performances on VideoQA tasks that challenge dynamic relation reasoning than prior arts in the pretraining-free scenario. Its performances even surpass those models that are pretrained with millions of external data. We further show that VGT can also benefit a lot from selfsupervised cross-modal pretraining, yet with orders of magnitude smaller data. These results clearly demonstrate the effectiveness and superiority of VGT, and reveal its potential for more data-efficient pretraining. With comprehensive analyses and some heuristic observations, we hope that VGT can promote VQA research beyond coarse recognition/description towards fine-grained relation reasoning in realistic videos. Our code is available at https://github.com/sail-sg/VGT.

show abstract

Section: Sate-of-the-art Comparisonmentioning

confidence: 85%

Section: Related Workmentioning

confidence: 99%

Section: Model Analysismentioning

confidence: 99%

Section: A Data Statisticsmentioning

confidence: 99%

See 2 more Smart Citations

Video Graph Transformer for Video Question Answering

Xiao¹,

Zhou²,

Chua³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In contrast, although concurrent work (HGA) by [Jiang and Han, 2020], and more recent works, B2A by [Park et al, 2021] and DualVGR by build the graph based on more coarsegrained video elements and word, they incorporate both intramodal and inter-modal relationship learning and achieve better performance. Considering the video elements are hierarchical in semantic space, [Liu et al, 2021a], [Peng et al, 2021] and separately incorporate hierarchical learning idea into graph networks. Specifically, [Liu et al, 2021a] propose a graph memory mechanism (HAIR), to perform relational vision-semantic reasoning from object level to frame level; [Peng et al, 2021] concatenate differentlevel graphs, that is, object-level, frame-level and clip-level, in a progressive manner to learn the visual relations (PGAT); while propose a hierarchical conditional graph model (HQGA) to weave together visual facts from low-level entities to higher level video elements through graph aggregation and pooling, to enables vision-text matching at multi-granularity levels.…”

Section: Methodsmentioning

confidence: 99%

Video Question Answering: Datasets, Algorithms and Challenges

Zhong¹,

Ji²,

Xiao³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos. It has earned increasing attention with recent research trends in joint vision and language understanding. Yet, compared with Im-ageQA, VideoQA is largely underexplored and progresses slowly. Although different algorithms have continually been proposed and shown success on different VideoQA datasets, we find that there lacks a meaningful survey to categorize them, which seriously impedes its advancements. This paper thus provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges. We then point out the research trend of studying beyond factoid QA to inference QA towards the cognition of video contents, Finally, we conclude some promising directions for future exploration.

show abstract

Video Graph Transformer for Video Question Answering

Xiao

Zhou²,

Chua

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully-and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining. We hope our success can advance VideoQA beyond coarse recognition/description towards fine-grained relation reasoning of video contents. Our code will be available at https://github.com/doc-doc/CoVGT.

show abstract

Progressive Graph Attention Network for Video Question Answering

Cited by 28 publications

References 24 publications

Video Graph Transformer for Video Question Answering

Video Graph Transformer for Video Question Answering

Video Question Answering: Datasets, Algorithms and Challenges

Video Graph Transformer for Video Question Answering

Contact Info

Product

Resources

About