Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019
DOI: 10.18653/v1/p19-1349
|View full text |Cite
|
Sign up to set email alerts
|

Multi-grained Attention with Object-level Grounding for Visual Question Answering

Abstract: Attention mechanisms are widely used in Visual Question Answering (VQA) to search for visual clues related to the question. Most approaches train attention models from a coarsegrained association between sentences and images, which tends to fail on small objects or uncommon concepts. To address this problem, this paper proposes a multi-grained attention method. It learns explicit wordobject correspondence by two types of wordlevel attention complementary to the sentenceimage association. Evaluated on the VQA b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
11
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 22 publications
(11 citation statements)
references
References 18 publications
0
11
0
Order By: Relevance
“…Visual concept learning. Learning visual concepts from language and other forms of supervision provides useful representations for various downstream tasks, such as image captioning (Yin and Ordonez, 2017;Wang et al, 2018), visual-question answering (Yi et al, 2018;Huang et al, 2019), shape differentiation (Achlioptas et al, 2019), image classification (Mu et al, 2020), and scene manipulation (Prabhudesai et al, 2020). Previous work has been focusing on various types of representations (Ren et al, 2016;Wu et al, 2017), training algorithms (Faghri et al, 2018;Morgado et al, 2020) and supervision (Johnson et al, 2016;Yang et al, 2018).…”
Section: Related Workmentioning
confidence: 99%
“…Visual concept learning. Learning visual concepts from language and other forms of supervision provides useful representations for various downstream tasks, such as image captioning (Yin and Ordonez, 2017;Wang et al, 2018), visual-question answering (Yi et al, 2018;Huang et al, 2019), shape differentiation (Achlioptas et al, 2019), image classification (Mu et al, 2020), and scene manipulation (Prabhudesai et al, 2020). Previous work has been focusing on various types of representations (Ren et al, 2016;Wu et al, 2017), training algorithms (Faghri et al, 2018;Morgado et al, 2020) and supervision (Johnson et al, 2016;Yang et al, 2018).…”
Section: Related Workmentioning
confidence: 99%
“…Visual concept learning. Learning visual concepts from language and other forms of supervision provides useful representations for various downstream tasks, such as image captioning (Yin and Ordonez, 2017;Wang et al, 2018), visual-question answering (Yi et al, 2018;Huang et al, 2019), and scene manipulation (Prabhudesai et al, 2020). Previous work has been focusing on various types of representations (Ren et al, 2016;Wu et al, 2017), training algorithms (Faghri et al, 2018) and supervision (Johnson et al, 2016;Yang et al, 2018).…”
Section: Related Workmentioning
confidence: 99%
“…Because many of these systems are designed to support voicebased dialog, they overlook non-textual forms of interaction used in social media conversations. In parallel, multimodal NLP systems have been developed for image data, often focusing on image-totext tasks such as image captioning (Melas-Kyriazi et al, 2018;Sharma et al, 2018) and visual question answering (Antol et al, 2015;Huang et al, 2019;Khademi, 2020). More recent work has focused on the reverse text-to-image dimension, such as generating an image from a description (Niu et al, 2020;Ramesh et al, 2021).…”
Section: Introductionmentioning
confidence: 99%