2015
DOI: 10.48550/arxiv.1511.03416
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Visual7W: Grounded Question Answering in Images

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
19
0

Year Published

2016
2016
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 24 publications
(19 citation statements)
references
References 37 publications
0
19
0
Order By: Relevance
“…Recently, there has been momentous success in using CNN [5] features along with Recurrent Neural Networks [6][7][8][9][10][11] (RNNs) to represent those temporal dynamics in data [12][13][14][15][16][17][18][19]. We aim to extend that idea to modeling the dynamics in storylines.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, there has been momentous success in using CNN [5] features along with Recurrent Neural Networks [6][7][8][9][10][11] (RNNs) to represent those temporal dynamics in data [12][13][14][15][16][17][18][19]. We aim to extend that idea to modeling the dynamics in storylines.…”
Section: Introductionmentioning
confidence: 99%
“…The results are shown in Table 4. It can be seen that our full model outperforms the baseline and the truncated model with an external parser, and achieves much higher accuracy than previous work [32]. Figure 6 shows some question answering examples on this dataset.…”
Section: Answering Pointing Questions In Visual-7wmentioning
confidence: 80%
“…Next we apply our method to real images and expressions in the Visual Genome dataset [14] and Google-Ref dataset [20]. Since the task of answering pointing questions in visual question answering is similar to grounding referential expressions, we also evaluate our model on the pointing questions in the Visual-7W dataset [32].…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Booted by the development of Deep Learning, letting the computer understand an image seems to be increasingly closer. With the research on object detection gradually becoming mature [37,36,32,24,22,23], increasingly more researchers put their attention on higher-level understanding of the scene [21,51,2,48,49,46,9,6,7,47]. As an intermediate level task connecting the image caption and object detection, visual relationship/phrase detection is gaining more attention in scene understanding [33,41,3].…”
Section: Introductionmentioning
confidence: 99%