2010
DOI: 10.1007/978-3-642-15561-1_2
|View full text |Cite
|
Sign up to set email alerts
|

Every Picture Tells a Story: Generating Sentences from Images

Abstract: Humans can prepare concise descriptions of pictures, focusing on what they find important. We demonstrate that automatic methods can do so too. We describe a system that can compute a score linking an image to a sentence. This score can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence. The score is obtained by comparing an estimate of meaning obtained from the image to one obtained from the sentence. Each estimate of meaning comes from a discrimina… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
644
1
4

Year Published

2012
2012
2023
2023

Publication Types

Select...
6
2
2

Relationship

0
10

Authors

Journals

citations
Cited by 911 publications
(650 citation statements)
references
References 19 publications
(18 reference statements)
1
644
1
4
Order By: Relevance
“…Methods in the first category use similarity metrics between image features from predefined models to retrieve similar sentences (Ordonez et al 2011;Hodosh et al 2013). Other methods map both sentences and their images to a common vector space (Ordonez et al 2011) or map them to a space of triples (Farhadi et al 2010). Among those in the second category, a common theme has been to use recurrent neural networks to produce novel captions (Kiros et al 2014;Mao et al 2014;Karpathy and Fei-Fei 2015;Vinyals et al 2015;Chen and Lawrence Zitnick 2015;Donahue et al 2015;Fang et al 2015).…”
Section: Image Descriptionsmentioning
confidence: 99%
“…Methods in the first category use similarity metrics between image features from predefined models to retrieve similar sentences (Ordonez et al 2011;Hodosh et al 2013). Other methods map both sentences and their images to a common vector space (Ordonez et al 2011) or map them to a space of triples (Farhadi et al 2010). Among those in the second category, a common theme has been to use recurrent neural networks to produce novel captions (Kiros et al 2014;Mao et al 2014;Karpathy and Fei-Fei 2015;Vinyals et al 2015;Chen and Lawrence Zitnick 2015;Donahue et al 2015;Fang et al 2015).…”
Section: Image Descriptionsmentioning
confidence: 99%
“…Here we take an analagous approach-modifying the image retrieval stage of data-driven pipeline-for the task of image captioning. There has been significant recent interest in generating natural language descriptions of photographs (Kulkarni et al 2013;Farhadi et al 2010b). These techniques are typically quite complex: they recognize various visual concepts such as objects, materials, scene types, and the spatial relationship among these entities, and then generate plausible natural language sentences based on this scene understanding.…”
Section: Scene Attributes As Global Featuresmentioning
confidence: 99%
“…Farhadi et al [21] contains 1000 images selected from 2008 PASCAL development kit, which belongs to 20 categories. Each image has 5 sentences as the description.…”
Section: Pascal Sentences Datasetmentioning
confidence: 99%