From captions to visual concepts and back

Fang, Hao; Gupta, Saurabh; Iandola, Forrest; Srivastava, Rupesh Kumar; Deng, Li; Dollár, Piotr; Gao, Jianfeng; He, Xiaodong; Mitchell, Margaret; Platt, John; Zitnick, C. Lawrence; Zweig, Geoffrey

doi:10.1109/cvpr.2015.7298754

Cited by 1,050 publications

(630 citation statements)

References 109 publications

(191 reference statements)

Supporting

Mentioning

629

Contrasting

Order By: Relevance

“…Other methods map both sentences and their images to a common vector space (Ordonez et al 2011) or map them to a space of triples (Farhadi et al 2010). Among those in the second category, a common theme has been to use recurrent neural networks to produce novel captions (Kiros et al 2014;Mao et al 2014;Karpathy and Fei-Fei 2015;Vinyals et al 2015;Chen and Lawrence Zitnick 2015;Donahue et al 2015;Fang et al 2015). More recently, researchers have also used a visual attention model .…”

Section: Image Descriptionsmentioning

confidence: 99%

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Krishna

Zhu

Groth³

et al. 2017

Int J Comput Vis

3,855

3,015

View full text Add to dashboard Cite

Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) to answer correctly that "the person is riding a horse-drawn carriage." In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 108K images where each image has an average of 35 objects, 26 attributes, and 21 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest

show abstract

Section: Image Descriptionsmentioning

confidence: 99%

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Krishna

Zhu

Groth³

et al. 2017

Int J Comput Vis

3,855

3,015

View full text Add to dashboard Cite

show abstract

“…Recently image description has gained increased attention with work such as that of , Donahue et al (2015), Fang et al (2015), Karpathy and FeiFei (2015), Kiros et al (2014Kiros et al ( , 2015, Mao et al (2015), Vinyals et al (2015) and Xu et al (2015a). Much of the recent work has relied on Recurrent Neural Networks (RNNs) and in particular on long short-term memory networks (LSTMs).…”

Section: Image Descriptionmentioning

confidence: 99%

Movie Description

et al. 2017

View full text Add to dashboard Cite

Audio description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. We introduce the Large Scale Movie Description Challenge (LSMDC) which contains a parallel corpus of 128,118 sentences aligned to video clips from 200 movies (around 150 h of video in total).

show abstract

“…Fang et al [16] proposed a caption generation system utilizing a bag of words method. Their work implements multiple instance learning and uses visual classifiers for words that commonly appear in existing captions.…”

Section: Recent Research On Image Captioningmentioning

confidence: 99%