2019 IEEE Winter Conference on Applications of Computer Vision (WACV) 2019
DOI: 10.1109/wacv.2019.00043
|View full text |Cite
|
Sign up to set email alerts
|

Interpretable Visual Question Answering by Visual Grounding From Attention Supervision Mining

Abstract: A key aspect of VQA models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
55
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 63 publications
(55 citation statements)
references
References 24 publications
0
55
0
Order By: Relevance
“…Visual-Explainable Ability in VQA Models. To improve visual-explainable ability, early works [33], [34], [35] directly apply human attention as supervision to guide the models' attention maps. However, since the existence of strong biases, even with appropriate attention maps, the remaining layers of network may still disregard the visual signal [36].…”
Section: Related Workmentioning
confidence: 99%
“…Visual-Explainable Ability in VQA Models. To improve visual-explainable ability, early works [33], [34], [35] directly apply human attention as supervision to guide the models' attention maps. However, since the existence of strong biases, even with appropriate attention maps, the remaining layers of network may still disregard the visual signal [36].…”
Section: Related Workmentioning
confidence: 99%
“…After softmax, each sentence consequently has a contextspecific distribution for each modality ET and EEG, reflecting the averaged responses of the human subjects. Following numerous other studies which have performed explicit attention supervision [34][35][36][37], we compute two attention losses as the Kullback-Leibler divergence (D KL ) from aggregate model attention weights α to ET and EEG distributions α ′′ ET and α ′ EEG . We do so for each sentence j in batches of size M for each modality to obtain eye-tracking loss L ET and EEG loss L EEG .…”
Section: Methodsmentioning
confidence: 99%
“…As fine-grained grounding becomes a potential incentive for next-generation visionlanguage systems, to what degree it can benefit remains an open question. On one hand, for VQA [4,40] the authors point out that the attention model does not attend to same regions as humans and adding attention supervision barely helps the performance. On the other hand, adding supervision to feature map attention [15,38] was found to be beneficial.…”
Section: Related Workmentioning
confidence: 99%