Multimodal Cross-guided Attention Networks for Visual Question Answering

Liu, Haibin; Gong, Shengrong; Yang, Jianyu; Xing, Tengfei; Liu, Chunping

doi:10.2991/cmsa-18.2018.80

Cited by 3 publications

(1 citation statement)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Seeing is Knowing (106) , MULAN (107) Faster R-CNN with ResNet-101 GAT (108) , ATH (109) , DMMGR (24) , MCLN (110) , MCAN (111) , F-SWAP (112) , SRRN (35) , TVQA (113) Faster R-CNN with Resnet-152 RA-MAP (114) , MASN (115) , Anamoly based (114) , Vocab based (116) , DA-Net (117) ResNet CNN within Faster R-CNN MuVAM (118) FasterR-CNN with ResNext-152 CBM (119) RCNN (120) Multi-image (89) VGGNet (121) VQA-AID (122) EfficientNetV2 (123) RealFormer (124) YOLO (125) Scene Text VQA (126) CLIPViT-B CCVQA (14) Resnet NFNet (127) Flamingo (128) ViT (129) VLMmed (46) , ConvS2S+ViT (130) , BMT (10) , M2I2 (52) XCLIP with ViT-L/14 CMQR (32) RsNet18, Swin, ViT LV-GPT (43) GLIP (131) REVIVE (132) CLIP (133) KVQAE (30) 2.6.4 VGGNet (121) VGGNet (Visual Geometry Group Network) is a CNN with a small number of layers, achieving good performance in image classification tasks. It is basically known for its simplicity and generalizability to new datasets.…”

Section: Faster Rcnnmentioning

confidence: 99%

A Review of Recent Advances in Visual Question Answering: Capsule Networks and Vision Transformers in Focus

Prakash,

Devananda

2024

IJST

View full text Add to dashboard Cite

Objectives: Multimodal deep learning, incorporating images, text, videos, speech, and acoustic signals, has grown significantly. This article aims to explore the untapped possibilities of multimodal deep learning in Visual Question Answering (VQA) and address a research gap in the development of effective techniques for comprehensive image feature extraction. Methods: This article provides a comprehensive overview of VQA and the associated challenges. It emphasizes the need for an extensive representation of images in VQA and pinpoints the specific research gap pertaining to image feature extraction and highlights the fundamental concepts of VQA, the challenges faced, different approaches and applications used for VQA tasks. A substantial portion of this review is devoted to investigating recent advancements in image feature extraction techniques. Findings: Most existing VQA research predominantly emphasizes the accurate matching of answers to given questions, often overlooking the necessity for a comprehensive representation of images. These models primarily rely on question content analysis while underemphasizing image understanding or sometimes neglect image examination entirely. There is also a tendency in multimodal systems to neglect or overemphasize one modality, notably the visual one, which challenges genuine multimodal integration. This article reveals that there is limited benchmarking for image feature extraction techniques. Evaluating the quality of extracted image features is crucial for VQA tasks. Novelty: While many VQA studies have primarily concentrated on the accuracy of answers to questions, this review emphasizes the importance of comprehensive image representation. The paper explores recent advances in Capsules Networks (CapsNets) and Vision Transformers (ViTs) as alternatives to traditional Convolutional Neural Networks (CNNs), for development of more effective image feature extraction techniques which can help to address the limitations of existing VQA models that focus primarily on question content analysis. https://www.indjst.org/

show abstract