2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00754
|View full text |Cite
|
Sign up to set email alerts
|

Neural Baby Talk

Abstract: We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image. Our approach reconciles classical slot filling approaches (that are generally better grounded in images) with modern neural captioning approaches (that are generally more natural sounding and accurate). Our approach first generates a sentence 'template' with slot locations explicitly tied to specific image regions. These slots are then filled in by visua… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
370
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
4
2

Relationship

1
9

Authors

Journals

citations
Cited by 434 publications
(372 citation statements)
references
References 45 publications
2
370
0
Order By: Relevance
“…Other work explores generalization to unseen combinations of visual concepts as a classification task (Misra et al, 2017;Kato et al, 2018). Lu et al (2018) is more closely related to our work; they evaluate captioning models on describing images with unseen noun-noun pairs.…”
Section: Compositional Models Of Languagementioning
confidence: 99%
“…Other work explores generalization to unseen combinations of visual concepts as a classification task (Misra et al, 2017;Kato et al, 2018). Lu et al (2018) is more closely related to our work; they evaluate captioning models on describing images with unseen noun-noun pairs.…”
Section: Compositional Models Of Languagementioning
confidence: 99%
“…Similar to [2,28,48], our RDN also utilizes the attention mechanism and follows the encoder-decoder framework. However, we explicitly study the coherence between words, which remedies the drawback of current captioning framework in modeling long-term dependency in decoder.…”
Section: Related Workmentioning
confidence: 99%
“…However, existing models could only describe the objects shown in the training image-caption pairs, which hinders the generalization of these models in real-world scenarios. How to describe images with unseen objects is still a challenge for image captioning [9], [10], [11].…”
Section: Introductionmentioning
confidence: 99%