2016
DOI: 10.48550/arxiv.1603.09016
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Rich Image Captioning in the Wild

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2017
2017
2019
2019

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 0 publications
0
4
0
Order By: Relevance
“…• CaptionBot (Tran et al, 2016): CaptionBot is a publicly image captioning system 2 which is mainly built on vision models by using Deep residual networks (ResNets) to detect visual concepts, MELM language model for sentence generation and DMSM for caption ranking. Entity recognition model for celebrities and landmarks is further incorporated to enrich captions and the confidence scoring model is finally utilized to select the output caption.…”
Section: Compared Approachesmentioning
confidence: 99%
“…• CaptionBot (Tran et al, 2016): CaptionBot is a publicly image captioning system 2 which is mainly built on vision models by using Deep residual networks (ResNets) to detect visual concepts, MELM language model for sentence generation and DMSM for caption ranking. Entity recognition model for celebrities and landmarks is further incorporated to enrich captions and the confidence scoring model is finally utilized to select the output caption.…”
Section: Compared Approachesmentioning
confidence: 99%
“…[7] firstly use Multiple Instance Learning to train attributes detector and then generate sentence through a maximum-entropy language model based on the outputs of attributes detector. Later in [27], this framework is further developed with a larger range of attributes, additionally including celebrities and landmarks, to enrich the generated sentence. More recently, in [31], highlevel concepts/attributes are shown to obtain clear improvements on image captioning task when injected into existing state-of-the-art RNN-based model and such visual attributes are also utilized as semantic attention in [34] to enhance image captioning.…”
Section: Related Workmentioning
confidence: 99%
“…Unbiased evaluations are notoriously difficult, and there is a growing trend toward evaluation on out-of-distribution data, i.e. where the test set is drawn from a different distribution than the training set [1,4,36,40]. In this spirit, our benchmark includes multiple training/test splits drawn from different distributions to evaluate generalization under controlled conditions.…”
Section: Evaluation Of High-level Tasks In Computer Visionmentioning
confidence: 99%
“…Generalisation is a key issues that limits the robustness, and thus practicality of deep learning (see ([19, 17, 13, 39] among many others). Current benchmarks that require visual reasoning, with few exceptions [1,4,40], use training and test splits that follow an identical distribution, which encourages methods to exploit dataset-specific biases (e.g. class imbalance) and superficial correlations [23,33].…”
Section: Introductionmentioning
confidence: 99%