2020
DOI: 10.1109/tkde.2020.2998805
|View full text |Cite
|
Sign up to set email alerts
|

Answer Again: Improving VQA with Cascaded-Answering Model

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
7
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2
2
1

Relationship

1
9

Authors

Journals

citations
Cited by 18 publications
(8 citation statements)
references
References 50 publications
0
7
0
Order By: Relevance
“…Further, H-CFIM gives lower accuracy compared to V-CFIM as it blends information from different attention paths but avoids the clash between the top-down and bottom-up paths. In Table 5, we have compared the performance of the proposed VQA method with 22 state-of-the-art methods, for example, Re-attention, 23 ALSA, 50 IASSM, 51 MRA-Net 35 and CAM 52 on both test-dev and test-std sets.…”
Section: Based On the Combination Tcam And Cfimmentioning
confidence: 99%
“…Further, H-CFIM gives lower accuracy compared to V-CFIM as it blends information from different attention paths but avoids the clash between the top-down and bottom-up paths. In Table 5, we have compared the performance of the proposed VQA method with 22 state-of-the-art methods, for example, Re-attention, 23 ALSA, 50 IASSM, 51 MRA-Net 35 and CAM 52 on both test-dev and test-std sets.…”
Section: Based On the Combination Tcam And Cfimmentioning
confidence: 99%
“…Recently, multi-modal analysis has attracted a lot of attention with the rapid growth of multi-media data. Different kinds of information they contain are complementary and help achieve comprehensive results [12,30,31,43,53]. So it is significant to learn multi-modal representation for boosting the single-modal tasks.…”
Section: Multi-modal Analysismentioning
confidence: 99%
“…In order to obtain more expressive images and question features, most existing models [1] highlight important words in the question and the image regions associated with the question using attention mechanisms. However, these existing methods only consider object attention, which may be sufficient to answer some simple questions, like the one in figure ??.…”
mentioning
confidence: 99%