2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2015
DOI: 10.1109/cvpr.2015.7298754
|View full text |Cite
|
Sign up to set email alerts
|

From captions to visual concepts and back

Abstract: This paper presents a novel approach for automatically generating image descriptions: visual detectors, language models, and multimodal similarity models learnt directly from a dataset of image captions. We use multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives. The word detector outputs serve as conditional inputs to a maximum-entropy language model. The language model learns from a set o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
629
1

Year Published

2015
2015
2023
2023

Publication Types

Select...
7
2
1

Relationship

0
10

Authors

Journals

citations
Cited by 1,050 publications
(630 citation statements)
references
References 109 publications
(191 reference statements)
0
629
1
Order By: Relevance
“…Other methods map both sentences and their images to a common vector space (Ordonez et al 2011) or map them to a space of triples (Farhadi et al 2010). Among those in the second category, a common theme has been to use recurrent neural networks to produce novel captions (Kiros et al 2014;Mao et al 2014;Karpathy and Fei-Fei 2015;Vinyals et al 2015;Chen and Lawrence Zitnick 2015;Donahue et al 2015;Fang et al 2015). More recently, researchers have also used a visual attention model .…”
Section: Image Descriptionsmentioning
confidence: 99%
“…Other methods map both sentences and their images to a common vector space (Ordonez et al 2011) or map them to a space of triples (Farhadi et al 2010). Among those in the second category, a common theme has been to use recurrent neural networks to produce novel captions (Kiros et al 2014;Mao et al 2014;Karpathy and Fei-Fei 2015;Vinyals et al 2015;Chen and Lawrence Zitnick 2015;Donahue et al 2015;Fang et al 2015). More recently, researchers have also used a visual attention model .…”
Section: Image Descriptionsmentioning
confidence: 99%
“…Recently image description has gained increased attention with work such as that of , Donahue et al (2015), Fang et al (2015), Karpathy and FeiFei (2015), Kiros et al (2014Kiros et al ( , 2015, Mao et al (2015), Vinyals et al (2015) and Xu et al (2015a). Much of the recent work has relied on Recurrent Neural Networks (RNNs) and in particular on long short-term memory networks (LSTMs).…”
Section: Image Descriptionmentioning
confidence: 99%
“…Fang et al [16] proposed a caption generation system utilizing a bag of words method. Their work implements multiple instance learning and uses visual classifiers for words that commonly appear in existing captions.…”
Section: Recent Research On Image Captioningmentioning
confidence: 99%