Visual7W: Grounded Question Answering in Images

Zhu, Yuke; Groth, Oliver; Bernstein, Michael S.; Li, Feifei

doi:10.48550/arxiv.1511.03416

Cited by 24 publications

(19 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, there has been momentous success in using CNN [5] features along with Recurrent Neural Networks [6][7][8][9][10][11] (RNNs) to represent those temporal dynamics in data [12][13][14][15][16][17][18][19]. We aim to extend that idea to modeling the dynamics in storylines.…”

Section: Introductionmentioning

confidence: 99%

Learning Visual Storylines with Skipping Recurrent Neural Networks

Sigurdsson

Chen

Gupta

2016

Computer Vision – ECCV 2016

View full text Add to dashboard Cite

What does a typical visit to Paris look like? Do people first take photos of the Louvre and then the Eiffel Tower? Can we visually model a temporal event like "Paris Vacation" using current frameworks? In this paper, we explore how we can automatically learn the temporal aspects, or storylines of visual concepts from web data. Previous attempts focus on consecutive image-to-image transitions and are unsuccessful at recovering the long-term underlying story. Our novel Skipping Recurrent Neural Network (S-RNN) model does not attempt to predict each and every data point in the sequence, like classic RNNs. Rather, S-RNN uses a framework that skips through the images in the photo stream to explore the space of all ordered subsets of the albums via an efficient sampling procedure. This approach reduces the negative impact of strong short-term correlations, and recovers the latent story more accurately. We show how our learned storylines can be used to analyze, predict, and summarize photo albums from Flickr. Our experimental results provide strong qualitative and quantitative evidence that S-RNN is significantly better than other candidate methods such as LSTMs on learning long-term correlations and recovering latent storylines. Moreover, we show how storylines can help machines better understand and summarize photo streams by inferring a brief personalized story of each individual album.

show abstract

Section: Introductionmentioning

confidence: 99%

Learning Visual Storylines with Skipping Recurrent Neural Networks

Sigurdsson

Chen

Gupta

2016

Computer Vision – ECCV 2016

View full text Add to dashboard Cite

show abstract

“…The results are shown in Table 4. It can be seen that our full model outperforms the baseline and the truncated model with an external parser, and achieves much higher accuracy than previous work [32]. Figure 6 shows some question answering examples on this dataset.…”

Section: Answering Pointing Questions In Visual-7wmentioning

confidence: 80%

“…Next we apply our method to real images and expressions in the Visual Genome dataset [14] and Google-Ref dataset [20]. Since the task of answering pointing questions in visual question answering is similar to grounding referential expressions, we also evaluate our model on the pointing questions in the Visual-7W dataset [32].…”

Section: Methodsmentioning

confidence: 99%

“…Finally, we evaluate our method on the multiple choice pointing questions (i.e. "which" questions) in visual question answering on the Visual-7W dataset [32]. Given an image and a question like "which tomato slice is under the knife", the task is to select the corresponding region from a few choice regions (4 choices in this dataset) as answer.…”

Section: Answering Pointing Questions In Visual-7wmentioning

confidence: 99%

See 1 more Smart Citation

Modeling Relationships in Referential Expressions with Compositional Modular Networks

Rohrbach

Andreas

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

366

371

View full text Add to dashboard Cite

People often refer to entities in an image in terms of their relationships with other entities. For example, the black cat sitting under the table refers to both a black cat entity and its relationship with another table entity. Understanding these relationships is essential for interpreting and grounding such natural language expressions. Most prior work focuses on either grounding entire referential expressions holistically to one region, or localizing relationships based on a fixed set of categories. In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene. We call this approach Compositional Modular Networks (CMNs): a novel architecture that learns linguistic analysis and visual inference end-to-end. Our approach is built around two types of neural modules that inspect local regions and pairwise interactions between regions. We evaluate CMNs on multiple referential expression datasets, outperforming state-of-the-art approaches on all tasks.

show abstract

“…Booted by the development of Deep Learning, letting the computer understand an image seems to be increasingly closer. With the research on object detection gradually becoming mature [37,36,32,24,22,23], increasingly more researchers put their attention on higher-level understanding of the scene [21,51,2,48,49,46,9,6,7,47]. As an intermediate level task connecting the image caption and object detection, visual relationship/phrase detection is gaining more attention in scene understanding [33,41,3].…”

Section: Introductionmentioning

confidence: 99%

ViP-CNN: Visual Phrase Guided Convolutional Neural Network

Ouyang

Wang

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

227

180

View full text Add to dashboard Cite

As the intermediate level task connecting image captioning and object detection, visual relationship detection started to catch researchers' attention because of its descriptive power and clear structure. It detects the objects and captures their pair-wise interactions with a subjectpredicate-object triplet, e.g. person-ride-horse . In this paper, each visual relationship is considered as a phrase with three components. We formulate the visual relationship detection as three inter-connected recognition problems and propose a Visual Phrase guided Convolutional Neural Network (ViP-CNN) to address them simultaneously. In ViP-CNN, we present a Phrase-guided Message Passing Structure (PMPS) to establish the connection among relationship components and help the model consider the three problems jointly. Corresponding non-maximum suppression method and model training strategy are also proposed. Experimental results show that our ViP-CNN outperforms the stateof-art method both in speed and accuracy. We further pretrain ViP-CNN on our cleansed Visual Genome Relationship dataset, which is found to perform better than the pretraining on the ImageNet for this task.

show abstract

Visual7W: Grounded Question Answering in Images

Cited by 24 publications

References 37 publications

Learning Visual Storylines with Skipping Recurrent Neural Networks

Learning Visual Storylines with Skipping Recurrent Neural Networks

Modeling Relationships in Referential Expressions with Compositional Modular Networks

ViP-CNN: Visual Phrase Guided Convolutional Neural Network

Contact Info

Product

Resources

About