2022
DOI: 10.48550/arxiv.2207.05342
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Video Graph Transformer for Video Question Answering

Abstract: This paper proposes a Video Graph Transformer (VGT) model for Video Quetion Answering (VideoQA). VGT's uniqueness are two-fold: 1) it designs a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations, and dynamics for complex spatio-temporal reasoning; and 2) it exploits disentangled video and text Transformers for relevance comparison between the video and text to perform QA, instead of entangled crossmodal Transformer for answer classification. Vision-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Publication Types

Select...

Relationship

0
0

Authors

Journals

citations
Cited by 0 publications
references
References 48 publications
0
0
0
Order By: Relevance

No citations

Set email alert for when this publication receives citations?