Every Picture Tells a Story: Generating Sentences from Images

Farhadi, Ali; Hejrati, Mohsen; Sadeghi, Maryam; Young, Paul Thomas; Rashtchian, Cyrus; Hockenmaier, Julia; Forsyth, David

doi:10.1007/978-3-642-15561-1_2

Cited by 911 publications

(650 citation statements)

References 19 publications

(18 reference statements)

Supporting

Mentioning

644

Contrasting

Unclassified

Order By: Relevance

“…Methods in the first category use similarity metrics between image features from predefined models to retrieve similar sentences (Ordonez et al 2011;Hodosh et al 2013). Other methods map both sentences and their images to a common vector space (Ordonez et al 2011) or map them to a space of triples (Farhadi et al 2010). Among those in the second category, a common theme has been to use recurrent neural networks to produce novel captions (Kiros et al 2014;Mao et al 2014;Karpathy and Fei-Fei 2015;Vinyals et al 2015;Chen and Lawrence Zitnick 2015;Donahue et al 2015;Fang et al 2015).…”

Section: Image Descriptionsmentioning

confidence: 99%

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Krishna

Zhu

Groth³

et al. 2017

Int J Comput Vis

3,832

2,998

View full text Add to dashboard Cite

Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) to answer correctly that "the person is riding a horse-drawn carriage." In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 108K images where each image has an average of 35 objects, 26 attributes, and 21 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest

show abstract

Section: Image Descriptionsmentioning

confidence: 99%

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Krishna

Zhu

Groth³

et al. 2017

Int J Comput Vis

3,832

2,998

View full text Add to dashboard Cite

show abstract

“…Here we take an analagous approach-modifying the image retrieval stage of data-driven pipeline-for the task of image captioning. There has been significant recent interest in generating natural language descriptions of photographs (Kulkarni et al 2013;Farhadi et al 2010b). These techniques are typically quite complex: they recognize various visual concepts such as objects, materials, scene types, and the spatial relationship among these entities, and then generate plausible natural language sentences based on this scene understanding.…”

Section: Scene Attributes As Global Featuresmentioning

confidence: 99%

The SUN Attribute Database: Beyond Categories for Deeper Scene Understanding

et al. 2014

View full text Add to dashboard Cite

In this paper we present the first large-scale scene attribute database. First, we perform crowdsourced human studies to find a taxonomy of 102 discriminative attributes. We discover attributes related to materials, surface properties, lighting, affordances, and spatial layout. Next, we build the "SUN attribute database" on top of the diverse SUN categorical database. We use crowdsourcing to annotate attributes for 14,340 images from 707 scene categories. We perform numerous experiments to study the interplay between scene attributes and scene categories. We train and evaluate attribute classifiers and then study the feasibility of attributes as an intermediate scene representation for scene classification, zero shot learning, automatic image captioning, semantic image search, and parsing natural images. We show that when used as features for these tasks, low dimensional scene attributes can compete with or improve on the state of the art performance. The experiments suggest that scene attributes are an effective low-dimensional feature for capturing high-level context and semantics in scenes.

show abstract

“…Farhadi et al [21] contains 1000 images selected from 2008 PASCAL development kit, which belongs to 20 categories. Each image has 5 sentences as the description.…”

Section: Pascal Sentences Datasetmentioning

confidence: 99%