Proceedings of the 2021 International Conference on Multimedia Retrieval 2021
DOI: 10.1145/3460426.3463635
|View full text |Cite
|
Sign up to set email alerts
|

Relation-aware Hierarchical Attention Framework for Video Question Answering

Abstract: Video Question Answering (VideoQA) is a challenging video understanding task since it requires a deep understanding of both question and video. Previous studies mainly focus on extracting sophisticated visual and language embeddings, fusing them by delicate hand-crafted networks. However, the relevance of different frames, objects, and modalities to the question are varied along with the time, which is ignored in most of existing methods. Lacking understanding of the the dynamic relationships and interactions … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(3 citation statements)
references
References 35 publications
(59 reference statements)
0
3
0
Order By: Relevance
“…MASN (Seo et al, 2021) introduce frame-level and clip-level modules to simultaneously model different-level correlation between visual information and question. RHA (Li et al, 2021) proposed to use hierarchical attention network to further model the video subtitlequestion correlation. There are also researches that adopt the memory-augmented approaches to capture the correlation (Fan et al, 2019;Yin et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…MASN (Seo et al, 2021) introduce frame-level and clip-level modules to simultaneously model different-level correlation between visual information and question. RHA (Li et al, 2021) proposed to use hierarchical attention network to further model the video subtitlequestion correlation. There are also researches that adopt the memory-augmented approaches to capture the correlation (Fan et al, 2019;Yin et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
“…To capture the visual-language relation, some works have been proposed to utilize bilinear pooling operation or spatial-temporal attention mechanism to allign the video and textual features (Jang et al, 2019;Seo et al, 2021). Some methods also proposed to use the co-attention mechanism (Jiang and Han, 2020;Li et al, 2021) to align multi-modal features, or use memory-augmented RNN (Yin et al, 2020) or graph memory mechanism to perform relational reasoning in VideoQA. Recently, DualVGR devises a graph-based reasoning unit and performs a word-level attention to obtain the question-related video features.…”
Section: Introductionmentioning
confidence: 99%
“…However, despite their effectiveness, pure textual information still cannot fully replicate a rich visual perceptual experience. To address this limitation, researchers have turned their attention to various visual language tasks, such as visual quizzing (Li et al; Zhang et al) [6,7], image and video caption generation (Chen F; Ghanimifard and Dobnik; Corniaet al) [8][9][10], and image-based question retrieval (Xin Yuan et al; Lu et al) [11,12]. In human conversational communication, images are crucial in compensating for information that cannot be accurately expressed through text alone.…”
Section: Introductionmentioning
confidence: 99%