Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Seo, Ahjeong; Kang, Gi-Cheon; Park, Joonhan; Zhang, Byoung-Tak

doi:10.18653/v1/2021.acl-long.481

Cited by 36 publications

(15 citation statements)

References 38 publications

(31 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cross-modal pretraining seems promising [29,67,70]. Yet, it requires the handling of prohibitively large-scale video-text data [15,70], or otherwise the performances are still inferior to the state-of-the-art (SoTA) conventional techniques [29,47,67]. In this work, we reveal two major reasons accounting for the failure: 1) Video encoders are overly simplistic.…”

Section: Introductionmentioning

confidence: 91%

“…Yet, most of them leverage frame-or clip-level video representations as information source. Recently, graphs constructed over object-level representations [19,36,47,60] have demonstrated superior performance, especially on benchmarks that emphasize visual relation reasoning [20,49,50,59]. However, these graph methods either construct monolithic graphs that do not disambiguate between relations in 1) space and time, 2) local and global scopes [19,57], or build static graphs at frame-level without explicitly capturing the temporal dynamics [36,42,60].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Video Graph Transformer for Video Question Answering

Xiao¹,

Zhou²,

Chua³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper proposes a Video Graph Transformer (VGT) model for Video Quetion Answering (VideoQA). VGT's uniqueness are two-fold: 1) it designs a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations, and dynamics for complex spatio-temporal reasoning; and 2) it exploits disentangled video and text Transformers for relevance comparison between the video and text to perform QA, instead of entangled crossmodal Transformer for answer classification. Vision-text communication is done by additional cross-modal interaction modules. With more reasonable video encoding and QA solution, we show that VGT can achieve much better performances on VideoQA tasks that challenge dynamic relation reasoning than prior arts in the pretraining-free scenario. Its performances even surpass those models that are pretrained with millions of external data. We further show that VGT can also benefit a lot from selfsupervised cross-modal pretraining, yet with orders of magnitude smaller data. These results clearly demonstrate the effectiveness and superiority of VGT, and reveal its potential for more data-efficient pretraining. With comprehensive analyses and some heuristic observations, we hope that VGT can promote VQA research beyond coarse recognition/description towards fine-grained relation reasoning in realistic videos. Our code is available at https://github.com/sail-sg/VGT.

show abstract

Section: Introductionmentioning

confidence: 91%

Section: Related Workmentioning

confidence: 99%

Video Graph Transformer for Video Question Answering

Xiao¹,

Zhou²,

Chua³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Various approaches to combine spatial image representations and sequential question representations have been proposed [6], [11], [30], [66], [99], [101], [106]. More specifically to the video domain (VideoQA), spatio-temporal video representations in terms of motion and appearance have been used in [23], [28], [32], [38], [41], [42], [43], [50], [51], [52], [58], [72], [79], [100], [102], [109], [114], [122].…”

Section: Related Workmentioning

confidence: 99%

Learning to Answer Visual Questions from Web Videos

Yang¹,

Miech

Šivic³

et al. 2024

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and the VideoQA feature probe evaluation setting and show excellent results, in particular for rare answers. Furthermore, our method achieves competitive results on MSRVTT-QA, ActivityNet-QA, MSVD-QA and How2QA datasets. We also show that our VideoQA dataset generation approach generalizes to another source of web video and text data. We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i.e., videos with alt-text annotations, and show its benefits for training VideoQA models. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations. Code, datasets and trained models are available on our project webpage 1 .

show abstract

“…Some researches have attempted to capture more fine-grained visuallanguage correlation. MASN (Seo et al, 2021) introduce frame-level and clip-level modules to simultaneously model different-level correlation between visual information and question. RHA (Li et al, 2021) proposed to use hierarchical attention network to further model the video subtitlequestion correlation.…”

Section: Related Workmentioning

confidence: 99%

“…To capture the visual-language relation, some works have been proposed to utilize bilinear pooling operation or spatial-temporal attention mechanism to allign the video and textual features (Jang et al, 2019;Seo et al, 2021). Some methods also proposed to use the co-attention mechanism (Jiang and Han, 2020;Li et al, 2021) to align multi-modal features, or use memory-augmented RNN (Yin et al, 2020) or graph memory mechanism to perform relational reasoning in VideoQA.…”

Section: Introductionmentioning

confidence: 99%

Modeling Semantic Composition with Syntactic Hypergraph for Video Question Answering

Xu¹,

Zhong²,

Su³

et al. 2022

Preprint

View full text Add to dashboard Cite

A key challenge in video question answering is how to realize the cross-modal semantic alignment between textual concepts and corresponding visual objects. Existing methods mostly seek to align the word representations with the video regions. However, word representations are often not able to convey a complete description of textual concepts, which are in general described by the compositions of certain words. To address this issue, we propose to first build a syntactic dependency tree for each question with an offthe-shelf tool and use it to guide the extraction of meaningful word compositions. Based on the extracted compositions, a hypergraph is further built by viewing the words as nodes and the compositions as hyperedges. Hypergraph convolutional networks (HCN) are then employed to learn the initial representations of word compositions. Afterwards, an optimal transport based method is proposed to perform cross-modal semantic alignment for the textual and visual semantic space. To reflect the cross-modal influences, the cross-modal information is incorporated into the initial representations, leading to a model named crossmodality-aware syntactic HCN. Experimental results on three benchmarks show that our method outperforms all strong baselines. Further analyses demonstrate the effectiveness of each component, and show that our model is good at modeling different levels of semantic compositions and filtering out irrelevant information.

show abstract

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Cited by 36 publications

References 38 publications

Video Graph Transformer for Video Question Answering

Video Graph Transformer for Video Question Answering

Learning to Answer Visual Questions from Web Videos

Modeling Semantic Composition with Syntactic Hypergraph for Video Question Answering

Contact Info

Product

Resources

About