Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Seo, Ahjeong; Kang, Gi-Cheon; Park, Joonhan; Zhang, Byoung-Tak

doi:10.48550/arxiv.2106.10446

Cited by 5 publications

(5 citation statements)

References 41 publications

(38 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We verify the proposed model on three wellknown datasets, MSRVTT-QA (Xu et al, 2017a), MSRVTT multi-choice (Yu et al, 2018a), and TGIF-QA (Jang et al, 2017), widely used in recent video QA works (Jang et al, 2017;Gao et al, 2018;Li et al, 2019;Fan et al, 2019;Le et al, 2020;Zhu and Yang, 2020;Lei et al, 2021;Seo et al, 2021). Experiments show that our model achieves dramatic improvement over the powerful state-of-the-art model ClipBERT (Lei et al, 2021), with an average accuracy increment of more than 3 percentage points.…”

Section: Introductionmentioning

confidence: 89%

“…Existing methods for video QA conduct direct answering selection based on the multimodal encoding of questions and videos (Jang et al, 2017;Lei et al, 2018Lei et al, , 2020. In recent years, researchers have proposed many optimization strategies for better performance in video question answering, e.g., designing delicate encoding mechanisms (Kim et al, 2020a;Nuamah, 2021;Gao et al, 2018;Li et al, 2019;Fan et al, 2019;Le et al, 2020;Jiang et al, 2020;Kim et al, 2020b;Seo et al, 2021) graphs , adopting video pre-trained language models (Li et al, 2020;Zellers et al, 2021;Li and Wang, 2020;Lei et al, 2021;Sun et al, 2019), and leveraging external knowledge or resources (Chadha et al, 2020;Liu et al, 2020b;Song et al, 2021;. Compared with conventional monomodal question answering tasks such as text QA (Oguz et al, 2021;Zhou et al, 2018; and table QA .…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering

Mao¹,

Jiang²,

Wang³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Existing video question answering (video QA) models lack the capacity for deep video understanding and flexible multistep reasoning. We propose for video QA a novel model which performs dynamic multistep reasoning between questions and videos. It creates video semantic representation based on the video scene graph composed of semantic elements of the video and semantic relations among these elements. Then, it performs multistep reasoning for better answer decision between the representations of the question and the video, and dynamically integrate the reasoning results. Experiments show the significant advantage of the proposed model against previous methods in accuracy and interpretability. Against the existing stateof-the-art model, the proposed model dramatically improves more than 4%/3.1%/2% on the three widely used video QA datasets, MSRVTT-QA, MSRVTT multi-choice, and TGIF-QA, and displays better interpretability by backtracing along with the attention mechanisms to the video scene graphs.

show abstract

Section: Introductionmentioning

confidence: 89%

Section: Introductionmentioning

confidence: 99%

Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering

Mao¹,

Jiang²,

Wang³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

show abstract

“…Since we are dealing with spatial and temporal dependencies graphs can help establish these dependencies very well and the work by (Seo et al 2021) presents the same. Object graphs are constructed via graph convolutional networks (GCN) to compute the relationships among objects in each visual feature.…”

Section: Graph Based Techniquesmentioning

confidence: 94%

Weight-based multi-stream model for Multi-Modal Video Question Answering

Rajesh

Sridhar

Kulkarni

et al. 2023

FLAIRS

View full text Add to dashboard Cite

There has been a tremendous success in individual domains of Computer Vision, Natural Language Processing, and Knowledge Representation. Videos are a rich source of information with the multi-modal data forms of images, audio, and optionally subtitles blended. Current research is going on in combining these individual domains which have given rise to topics such as image captioning, visual question answering, and video question answering. Video Question Answering is a model which combines research topics like object detection and recognition, temporal information processing, visual attention, and natural language processing. In this paper, we propose a model with Attention Mechanism for Video Question Answering that assigns varying weights to the many pieces of information the video encompasses. The model combines the question with 3 streams i.e., video's frames, subtitles, and objects to get the most probable answer. The model also receives the set of answer candidates as input and predicts one of them as the most probable answer since it has been trained and tested on the TVQA dataset.

show abstract

“…Seeing is Knowing (106) , MULAN (107) Faster R-CNN with ResNet-101 GAT (108) , ATH (109) , DMMGR (24) , MCLN (110) , MCAN (111) , F-SWAP (112) , SRRN (35) , TVQA (113) Faster R-CNN with Resnet-152 RA-MAP (114) , MASN (115) , Anamoly based (114) , Vocab based (116) , DA-Net (117) ResNet CNN within Faster R-CNN MuVAM (118) FasterR-CNN with ResNext-152 CBM (119) RCNN (120) Multi-image (89) VGGNet (121) VQA-AID (122) EfficientNetV2 (123) RealFormer (124) YOLO (125) Scene Text VQA (126) CLIPViT-B CCVQA (14) Resnet NFNet (127) Flamingo (128) ViT (129) VLMmed (46) , ConvS2S+ViT (130) , BMT (10) , M2I2 (52) XCLIP with ViT-L/14 CMQR (32) RsNet18, Swin, ViT LV-GPT (43) GLIP (131) REVIVE (132) CLIP (133) KVQAE (30) 2.6.4 VGGNet (121) VGGNet (Visual Geometry Group Network) is a CNN with a small number of layers, achieving good performance in image classification tasks. It is basically known for its simplicity and generalizability to new datasets.…”

Section: Faster Rcnnmentioning

confidence: 99%

A Review of Recent Advances in Visual Question Answering: Capsule Networks and Vision Transformers in Focus

Prakash,

Devananda

2024

IJST

View full text Add to dashboard Cite

Objectives: Multimodal deep learning, incorporating images, text, videos, speech, and acoustic signals, has grown significantly. This article aims to explore the untapped possibilities of multimodal deep learning in Visual Question Answering (VQA) and address a research gap in the development of effective techniques for comprehensive image feature extraction. Methods: This article provides a comprehensive overview of VQA and the associated challenges. It emphasizes the need for an extensive representation of images in VQA and pinpoints the specific research gap pertaining to image feature extraction and highlights the fundamental concepts of VQA, the challenges faced, different approaches and applications used for VQA tasks. A substantial portion of this review is devoted to investigating recent advancements in image feature extraction techniques. Findings: Most existing VQA research predominantly emphasizes the accurate matching of answers to given questions, often overlooking the necessity for a comprehensive representation of images. These models primarily rely on question content analysis while underemphasizing image understanding or sometimes neglect image examination entirely. There is also a tendency in multimodal systems to neglect or overemphasize one modality, notably the visual one, which challenges genuine multimodal integration. This article reveals that there is limited benchmarking for image feature extraction techniques. Evaluating the quality of extracted image features is crucial for VQA tasks. Novelty: While many VQA studies have primarily concentrated on the accuracy of answers to questions, this review emphasizes the importance of comprehensive image representation. The paper explores recent advances in Capsules Networks (CapsNets) and Vision Transformers (ViTs) as alternatives to traditional Convolutional Neural Networks (CNNs), for development of more effective image feature extraction techniques which can help to address the limitations of existing VQA models that focus primarily on question content analysis. https://www.indjst.org/

show abstract

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Cited by 5 publications

References 41 publications

Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering

Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering

Weight-based multi-stream model for Multi-Modal Video Question Answering

A Review of Recent Advances in Visual Question Answering: Capsule Networks and Vision Transformers in Focus

Contact Info

Product

Resources

About