Relation-aware Hierarchical Attention Framework for Video Question Answering

Li, Fangtao; Bai, Ting; Cao, Chenyu; Liu, Zihe; Yan, Chenghao; Wu, Bin

doi:10.1145/3460426.3463635

Cited by 9 publications

(3 citation statements)

References 35 publications

(59 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MASN (Seo et al, 2021) introduce frame-level and clip-level modules to simultaneously model different-level correlation between visual information and question. RHA (Li et al, 2021) proposed to use hierarchical attention network to further model the video subtitlequestion correlation. There are also researches that adopt the memory-augmented approaches to capture the correlation (Fan et al, 2019;Yin et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

“…To capture the visual-language relation, some works have been proposed to utilize bilinear pooling operation or spatial-temporal attention mechanism to allign the video and textual features (Jang et al, 2019;Seo et al, 2021). Some methods also proposed to use the co-attention mechanism (Jiang and Han, 2020;Li et al, 2021) to align multi-modal features, or use memory-augmented RNN (Yin et al, 2020) or graph memory mechanism to perform relational reasoning in VideoQA. Recently, DualVGR devises a graph-based reasoning unit and performs a word-level attention to obtain the question-related video features.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Modeling Semantic Composition with Syntactic Hypergraph for Video Question Answering

Xu¹,

Zhong²,

Su³

et al. 2022

Preprint

View full text Add to dashboard Cite

A key challenge in video question answering is how to realize the cross-modal semantic alignment between textual concepts and corresponding visual objects. Existing methods mostly seek to align the word representations with the video regions. However, word representations are often not able to convey a complete description of textual concepts, which are in general described by the compositions of certain words. To address this issue, we propose to first build a syntactic dependency tree for each question with an offthe-shelf tool and use it to guide the extraction of meaningful word compositions. Based on the extracted compositions, a hypergraph is further built by viewing the words as nodes and the compositions as hyperedges. Hypergraph convolutional networks (HCN) are then employed to learn the initial representations of word compositions. Afterwards, an optimal transport based method is proposed to perform cross-modal semantic alignment for the textual and visual semantic space. To reflect the cross-modal influences, the cross-modal information is incorporated into the initial representations, leading to a model named crossmodality-aware syntactic HCN. Experimental results on three benchmarks show that our method outperforms all strong baselines. Further analyses demonstrate the effectiveness of each component, and show that our model is good at modeling different levels of semantic compositions and filtering out irrelevant information.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Modeling Semantic Composition with Syntactic Hypergraph for Video Question Answering

Xu¹,

Zhong²,

Su³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…However, despite their effectiveness, pure textual information still cannot fully replicate a rich visual perceptual experience. To address this limitation, researchers have turned their attention to various visual language tasks, such as visual quizzing (Li et al; Zhang et al) [6,7], image and video caption generation (Chen F; Ghanimifard and Dobnik; Corniaet al) [8][9][10], and image-based question retrieval (Xin Yuan et al; Lu et al) [11,12]. In human conversational communication, images are crucial in compensating for information that cannot be accurately expressed through text alone.…”

Section: Introductionmentioning

confidence: 99%

Multimodal Dialogue Response Generation Based on Selective Attention and Gating Mechanisms

Yu,

Zhang,

Ding

et al. 2023

Preprint

View full text Add to dashboard Cite

Picture response is a crucial aspect in dialogue systems; however, existing systems predominantly focus on text information, often neglecting the use of image data or relying on large-scale language models, which leads to high training costs and slower generation speeds. To address these challenges, this paper introduces the MDSAGM (Selective Attention and Gating Mechanism)model, which comprises a Transformer-based Text Dialogue Response Generator, Text-to-Image Generation, and SAGM (Selective Attention and Gating Mechanism). The model is core lies in combining Selective Attention and Gating Mechanism to enhance generalization and improve model accuracy. The study further explores the contributions of image and text information in multimodal fusion. Compared to other large-scale training models, the proposed model demonstrates computational efficiency , fewer parameters, and shorter response times, making it more lightweight. This dialogue model facilitates the separation of multimodal dialogue parameters from the overall model, enabling better parameter fitting through pre-training with abundant plain text and text-image data. Extensive experiments show that the proposed method achieves promising results in both automatic and manual evaluations, generating information-rich text and image responses with higher accuracy while maintaining a lightweight structure.

show abstract

ReGR: Relation-aware graph reasoning framework for video question answering

Wang

Ota

et al. 2023

Information Processing & Management

View full text Add to dashboard Cite

Relation-aware Hierarchical Attention Framework for Video Question Answering

Cited by 9 publications

References 35 publications

Modeling Semantic Composition with Syntactic Hypergraph for Video Question Answering

Modeling Semantic Composition with Syntactic Hypergraph for Video Question Answering

Multimodal Dialogue Response Generation Based on Selective Attention and Gating Mechanisms

ReGR: Relation-aware graph reasoning framework for video question answering

Contact Info

Product

Resources

About