2021 IEEE International Conference on Image Processing (ICIP) 2021
DOI: 10.1109/icip42928.2021.9506796
|View full text |Cite
|
Sign up to set email alerts
|

Vision And Text Transformer For Predicting Answerability On Visual Question Answering

Abstract: Answerability on Visual Question Answering is a novel and attractive task to predict answerable scores between images and questions in multi-modal data. Existing works often utilize a binary mapping from visual question answering systems into Answerability. It does not reflect the essence of this problem. Together with our consideration of Answerability in a regression task, we propose VT-Transformer, which exploits visual and textual features through Transformer architecture. Experimental results on VizWiz 20… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
0
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(1 citation statement)
references
References 10 publications
(16 reference statements)
0
0
0
Order By: Relevance
“…It creates a CDVQA dataset and devises a baseline CDVQA framework, exploring different backbones and fusion strategies. The (50) presents VT-Transformer, an approach for Answerability on VQA which achieves competitive results on the VizWiz 2020 dataset. The (51) describes the AliceMind-MMU system, which achieves human-level performance on VQA by pre-training with comprehensive visual and textual feature representation and using specialized expert modules for different types of visual questions.…”
Section: Vision Transformers For Visual Question Answeringmentioning
confidence: 99%
“…It creates a CDVQA dataset and devises a baseline CDVQA framework, exploring different backbones and fusion strategies. The (50) presents VT-Transformer, an approach for Answerability on VQA which achieves competitive results on the VizWiz 2020 dataset. The (51) describes the AliceMind-MMU system, which achieves human-level performance on VQA by pre-training with comprehensive visual and textual feature representation and using specialized expert modules for different types of visual questions.…”
Section: Vision Transformers For Visual Question Answeringmentioning
confidence: 99%