Rich Image Captioning in the Wild

Tran, Kenneth; He, Xiaodong; Zhang, Lei; Sun, Jun; Carapcea, Cornelia; Thrasher, Chris; Buehler, Chris; Sienkiewicz, Chris

doi:10.48550/arxiv.1603.09016

Cited by 3 publications

(4 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• CaptionBot (Tran et al, 2016): CaptionBot is a publicly image captioning system 2 which is mainly built on vision models by using Deep residual networks (ResNets) to detect visual concepts, MELM language model for sentence generation and DMSM for caption ranking. Entity recognition model for celebrities and landmarks is further incorporated to enrich captions and the confidence scoring model is finally utilized to select the output caption.…”

Section: Compared Approachesmentioning

confidence: 99%

Boosting Image Captioning with Attributes

Yao

Pan

et al. 2017

2017 IEEE International Conference on Computer Vision (ICCV)

605

361

View full text Add to dashboard Cite

Automatically describing an image with a natural language has been an emerging challenge in both fields of computer vision and natural language processing. In this paper, we present Long Short-Term Memory with Attributes (LSTM-A) -a novel architecture that integrates attributes into the successful Convolutional Neural Networks (CNNs) plus Recurrent Neural Networks (RNNs) image captioning framework, by training them in an end-to-end manner. To incorporate attributes, we construct variants of architectures by feeding image representations and attributes into RNNs in different ways to explore the mutual but also fuzzy relationship between them. Extensive experiments are conducted on COCO image captioning dataset and our framework achieves superior results when compared to state-of-the-art deep models. Most remarkably, we obtain METEOR/CIDEr-D of 25.2%/98.6% on testing data of widely used and publicly available splits in when extracting image representations by GoogleNet and achieve to date top-1 performance on COCO captioning Leaderboard.Under review as a conference paper at ICLR 2017 ments and moments, e.g., leveraging only attributes, inserting image representations first and then attributes or vice versa, and inputting image representations/attributes once or at each time step.

show abstract

Section: Compared Approachesmentioning

confidence: 99%

Boosting Image Captioning with Attributes

Yao

Pan

et al. 2017

2017 IEEE International Conference on Computer Vision (ICCV)

605

361

View full text Add to dashboard Cite

show abstract

“…[7] firstly use Multiple Instance Learning to train attributes detector and then generate sentence through a maximum-entropy language model based on the outputs of attributes detector. Later in [27], this framework is further developed with a larger range of attributes, additionally including celebrities and landmarks, to enrich the generated sentence. More recently, in [31], highlevel concepts/attributes are shown to obtain clear improvements on image captioning task when injected into existing state-of-the-art RNN-based model and such visual attributes are also utilized as semantic attention in [34] to enhance image captioning.…”

Section: Related Workmentioning

confidence: 99%

Video Captioning with Transferred Semantic Attributes

Pan

Yao

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

318

177

View full text Add to dashboard Cite

Automatically generating natural language descriptions of videos plays a fundamental challenge for computer vision community. Most recent progress in this problem has been achieved through employing 2-D and/or 3-D Convolutional Neural Networks (CNN) to encode video content and Recurrent Neural Networks (RNN) to decode a sentence. In this paper, we present Long Short-Term Memory with Transferred Semantic Attributes (LSTM-TSA)-a novel deep architecture that incorporates the transferred semantic attributes learnt from images and videos into the CNN plus RNN framework, by training them in an end-to-end manner. The design of LSTM-TSA is highly inspired by the facts that 1) semantic attributes play a significant contribution to captioning, and 2) images and videos carry complementary semantics and thus can reinforce each other for captioning. To boost video captioning, we propose a novel transfer unit to model the mutually correlated attributes learnt from images and videos. Extensive experiments are conducted on three public datasets, i.e., MSVD, M-VAD and MPII-MD. Our proposed LSTM-TSA achieves to-date the best published performance in sentence generation on MSVD: 52.8% and 74.0% in terms of BLEU@4 and CIDEr-D. Superior results when compared to state-of-the-art methods are also reported on M-VAD and MPII-MD.

show abstract

“…Unbiased evaluations are notoriously difficult, and there is a growing trend toward evaluation on out-of-distribution data, i.e. where the test set is drawn from a different distribution than the training set [1,4,36,40]. In this spirit, our benchmark includes multiple training/test splits drawn from different distributions to evaluate generalization under controlled conditions.…”

Section: Evaluation Of High-level Tasks In Computer Visionmentioning

confidence: 99%

“…Generalisation is a key issues that limits the robustness, and thus practicality of deep learning (see ([19, 17, 13, 39] among many others). Current benchmarks that require visual reasoning, with few exceptions [1,4,40], use training and test splits that follow an identical distribution, which encourages methods to exploit dataset-specific biases (e.g. class imbalance) and superficial correlations [23,33].…”

Section: Introductionmentioning

confidence: 99%

V-PROM: A Benchmark for Visual Reasoning Using Visual Progressive Matrices

Teney¹,

Wang²,

Cao³

et al. 2019

Preprint

View full text Add to dashboard Cite

One of the primary challenges faced by deep learning is the degree to which current methods exploit superficial statistics and dataset bias, rather than learning to generalise over the specific representations they have experienced. This is a critical concern because generalisation enables robust reasoning over unseen data, whereas leveraging superficial statistics is fragile to even small changes in data distribution. To illuminate the issue and drive progress towards a solution, we propose a test that explicitly evaluates abstract reasoning over visual data. We introduce a large-scale benchmark of visual questions that involve operations fundamental to many high-level vision tasks, such as comparisons of counts and logical operations on complex visual properties. The benchmark directly measures a method's ability to infer high-level relationships and to generalise them over image-based concepts. It includes multiple training/test splits that require controlled levels of generalization. We evaluate a range of deep learning architectures, and find that existing models, including those popular for vision-and-language tasks, are unable to solve seemingly-simple instances. Models using relational networks fare better but leave substantial room for improvement.

show abstract

Rich Image Captioning in the Wild

Cited by 3 publications

References 0 publications

Boosting Image Captioning with Attributes

Boosting Image Captioning with Attributes

Video Captioning with Transferred Semantic Attributes

V-PROM: A Benchmark for Visual Reasoning Using Visual Progressive Matrices

Contact Info

Product

Resources

About