Vision And Text Transformer For Predicting Answerability On Visual Question Answering

Le, Tung T.; Nguyen, Huy Tuan; Nguyen, Minh Le

doi:10.1109/icip42928.2021.9506796

Cited by 5 publications

(1 citation statement)

References 10 publications

(16 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It creates a CDVQA dataset and devises a baseline CDVQA framework, exploring different backbones and fusion strategies. The (50) presents VT-Transformer, an approach for Answerability on VQA which achieves competitive results on the VizWiz 2020 dataset. The (51) describes the AliceMind-MMU system, which achieves human-level performance on VQA by pre-training with comprehensive visual and textual feature representation and using specialized expert modules for different types of visual questions.…”

Section: Vision Transformers For Visual Question Answeringmentioning

confidence: 99%

A Review of Recent Advances in Visual Question Answering: Capsule Networks and Vision Transformers in Focus

Prakash,

Devananda

2024

IJST

View full text Add to dashboard Cite

Objectives: Multimodal deep learning, incorporating images, text, videos, speech, and acoustic signals, has grown significantly. This article aims to explore the untapped possibilities of multimodal deep learning in Visual Question Answering (VQA) and address a research gap in the development of effective techniques for comprehensive image feature extraction. Methods: This article provides a comprehensive overview of VQA and the associated challenges. It emphasizes the need for an extensive representation of images in VQA and pinpoints the specific research gap pertaining to image feature extraction and highlights the fundamental concepts of VQA, the challenges faced, different approaches and applications used for VQA tasks. A substantial portion of this review is devoted to investigating recent advancements in image feature extraction techniques. Findings: Most existing VQA research predominantly emphasizes the accurate matching of answers to given questions, often overlooking the necessity for a comprehensive representation of images. These models primarily rely on question content analysis while underemphasizing image understanding or sometimes neglect image examination entirely. There is also a tendency in multimodal systems to neglect or overemphasize one modality, notably the visual one, which challenges genuine multimodal integration. This article reveals that there is limited benchmarking for image feature extraction techniques. Evaluating the quality of extracted image features is crucial for VQA tasks. Novelty: While many VQA studies have primarily concentrated on the accuracy of answers to questions, this review emphasizes the importance of comprehensive image representation. The paper explores recent advances in Capsules Networks (CapsNets) and Vision Transformers (ViTs) as alternatives to traditional Convolutional Neural Networks (CNNs), for development of more effective image feature extraction techniques which can help to address the limitations of existing VQA models that focus primarily on question content analysis. https://www.indjst.org/

show abstract

Section: Vision Transformers For Visual Question Answeringmentioning

confidence: 99%