Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-long.481
|View full text |Cite
|
Sign up to set email alerts
|

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Abstract: Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose Motion-Appearance Synergistic Networks (MASN), which embed two crossmodal features grounded on motion and appearance information and selectively utilize the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
15
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 36 publications
(15 citation statements)
references
References 38 publications
(31 reference statements)
0
15
0
Order By: Relevance
“…Cross-modal pretraining seems promising [29,67,70]. Yet, it requires the handling of prohibitively large-scale video-text data [15,70], or otherwise the performances are still inferior to the state-of-the-art (SoTA) conventional techniques [29,47,67]. In this work, we reveal two major reasons accounting for the failure: 1) Video encoders are overly simplistic.…”
Section: Introductionmentioning
confidence: 91%
See 1 more Smart Citation
“…Cross-modal pretraining seems promising [29,67,70]. Yet, it requires the handling of prohibitively large-scale video-text data [15,70], or otherwise the performances are still inferior to the state-of-the-art (SoTA) conventional techniques [29,47,67]. In this work, we reveal two major reasons accounting for the failure: 1) Video encoders are overly simplistic.…”
Section: Introductionmentioning
confidence: 91%
“…Yet, most of them leverage frame-or clip-level video representations as information source. Recently, graphs constructed over object-level representations [19,36,47,60] have demonstrated superior performance, especially on benchmarks that emphasize visual relation reasoning [20,49,50,59]. However, these graph methods either construct monolithic graphs that do not disambiguate between relations in 1) space and time, 2) local and global scopes [19,57], or build static graphs at frame-level without explicitly capturing the temporal dynamics [36,42,60].…”
Section: Related Workmentioning
confidence: 99%
“…Various approaches to combine spatial image representations and sequential question representations have been proposed [6], [11], [30], [66], [99], [101], [106]. More specifically to the video domain (VideoQA), spatio-temporal video representations in terms of motion and appearance have been used in [23], [28], [32], [38], [41], [42], [43], [50], [51], [52], [58], [72], [79], [100], [102], [109], [114], [122].…”
Section: Related Workmentioning
confidence: 99%
“…Some researches have attempted to capture more fine-grained visuallanguage correlation. MASN (Seo et al, 2021) introduce frame-level and clip-level modules to simultaneously model different-level correlation between visual information and question. RHA (Li et al, 2021) proposed to use hierarchical attention network to further model the video subtitlequestion correlation.…”
Section: Related Workmentioning
confidence: 99%
“…To capture the visual-language relation, some works have been proposed to utilize bilinear pooling operation or spatial-temporal attention mechanism to allign the video and textual features (Jang et al, 2019;Seo et al, 2021). Some methods also proposed to use the co-attention mechanism (Jiang and Han, 2020;Li et al, 2021) to align multi-modal features, or use memory-augmented RNN (Yin et al, 2020) or graph memory mechanism to perform relational reasoning in VideoQA.…”
Section: Introductionmentioning
confidence: 99%