2018
DOI: 10.2991/cmsa-18.2018.80
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Cross-guided Attention Networks for Visual Question Answering

Abstract: Abstract-Visual Question Answering (VQA) is an attractive topic combining computer vision with natural language processing. It is more challenging than text-based question answering because of its multimodal nature. The VQA reasoning process requires both effective semantic embedding and fine-grained visual comprehension. Existing approaches predominantly infer answers from visual spatial information, while neglecting important semantic information in questions and the guidance information between images and q… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(1 citation statement)
references
References 15 publications
0
1
0
Order By: Relevance
“…Seeing is Knowing (106) , MULAN (107) Faster R-CNN with ResNet-101 GAT (108) , ATH (109) , DMMGR (24) , MCLN (110) , MCAN (111) , F-SWAP (112) , SRRN (35) , TVQA (113) Faster R-CNN with Resnet-152 RA-MAP (114) , MASN (115) , Anamoly based (114) , Vocab based (116) , DA-Net (117) ResNet CNN within Faster R-CNN MuVAM (118) FasterR-CNN with ResNext-152 CBM (119) RCNN (120) Multi-image (89) VGGNet (121) VQA-AID (122) EfficientNetV2 (123) RealFormer (124) YOLO (125) Scene Text VQA (126) CLIPViT-B CCVQA (14) Resnet NFNet (127) Flamingo (128) ViT (129) VLMmed (46) , ConvS2S+ViT (130) , BMT (10) , M2I2 (52) XCLIP with ViT-L/14 CMQR (32) RsNet18, Swin, ViT LV-GPT (43) GLIP (131) REVIVE (132) CLIP (133) KVQAE (30) 2.6.4 VGGNet (121) VGGNet (Visual Geometry Group Network) is a CNN with a small number of layers, achieving good performance in image classification tasks. It is basically known for its simplicity and generalizability to new datasets.…”
Section: Faster Rcnnmentioning
confidence: 99%
“…Seeing is Knowing (106) , MULAN (107) Faster R-CNN with ResNet-101 GAT (108) , ATH (109) , DMMGR (24) , MCLN (110) , MCAN (111) , F-SWAP (112) , SRRN (35) , TVQA (113) Faster R-CNN with Resnet-152 RA-MAP (114) , MASN (115) , Anamoly based (114) , Vocab based (116) , DA-Net (117) ResNet CNN within Faster R-CNN MuVAM (118) FasterR-CNN with ResNext-152 CBM (119) RCNN (120) Multi-image (89) VGGNet (121) VQA-AID (122) EfficientNetV2 (123) RealFormer (124) YOLO (125) Scene Text VQA (126) CLIPViT-B CCVQA (14) Resnet NFNet (127) Flamingo (128) ViT (129) VLMmed (46) , ConvS2S+ViT (130) , BMT (10) , M2I2 (52) XCLIP with ViT-L/14 CMQR (32) RsNet18, Swin, ViT LV-GPT (43) GLIP (131) REVIVE (132) CLIP (133) KVQAE (30) 2.6.4 VGGNet (121) VGGNet (Visual Geometry Group Network) is a CNN with a small number of layers, achieving good performance in image classification tasks. It is basically known for its simplicity and generalizability to new datasets.…”
Section: Faster Rcnnmentioning
confidence: 99%