2017 IEEE International Conference on Computer Vision (ICCV) 2017
DOI: 10.1109/iccv.2017.524
|View full text |Cite
|
Sign up to set email alerts
|

Boosting Image Captioning with Attributes

Abstract: Automatically describing an image with a natural language has been an emerging challenge in both fields of computer vision and natural language processing. In this paper, we present Long Short-Term Memory with Attributes (LSTM-A) -a novel architecture that integrates attributes into the successful Convolutional Neural Networks (CNNs) plus Recurrent Neural Networks (RNNs) image captioning framework, by training them in an end-to-end manner. To incorporate attributes, we construct variants of architectures by fe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
365
2
2

Year Published

2018
2018
2019
2019

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 608 publications
(370 citation statements)
references
References 23 publications
1
365
2
2
Order By: Relevance
“…So in the experiments, we focus on comparison to a strong CNN-LSTM baseline. We acknowledge that more recent papers (Xu et al, 2017;Rennie et al, 2017;Yao et al, 2017;Lu et al, 2017;Gan et al, 2017) reported better performance on the task of image captioning. Performance improvements in these more recent models are mainly due to using better image features such as those obtained by Region-based Convolutional Neural Networks (R-CNN), or using reinforcement learning (RL) to directly optimize metrics such as CIDEr, or using more complex attention mechanisms (Gan et al, 2017) to provide a better context vector for caption generation, or using an ensemble of multiple LSTMs, among others.…”
Section: Discussionmentioning
confidence: 73%
“…So in the experiments, we focus on comparison to a strong CNN-LSTM baseline. We acknowledge that more recent papers (Xu et al, 2017;Rennie et al, 2017;Yao et al, 2017;Lu et al, 2017;Gan et al, 2017) reported better performance on the task of image captioning. Performance improvements in these more recent models are mainly due to using better image features such as those obtained by Region-based Convolutional Neural Networks (R-CNN), or using reinforcement learning (RL) to directly optimize metrics such as CIDEr, or using more complex attention mechanisms (Gan et al, 2017) to provide a better context vector for caption generation, or using an ensemble of multiple LSTMs, among others.…”
Section: Discussionmentioning
confidence: 73%
“…Despite keen competition, we are still one of the top ten methods in terms of the overall performance. It is noted that those methods, which outperform our method, utilize either more complicated REINFORCE to maximize the likelihood [40,53] or time-consuming attribute learning [64] and adaptive attention [42]. In principle, our idea of using weighted training and reference can also be applied to the frameworks, such as reinforcement learning and attribute learning, which will be one of our future works.…”
Section: Comparison With the State-of-the-artsmentioning
confidence: 96%
“…In the next section, we will firstly describe the conventional likelihood function for image captioning used in previous works [42,64] (see Eq. 4), and then we will introduce the proposed likelihood objective function (see Eq.…”
Section: System Overviewmentioning
confidence: 99%
“…In recent years, a variety of successive models [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][18][19][20] have achieved promising results. To generate captions, semantic concepts or attributes of objects in images are detected and utilized as inputs of the RNN decoder [3,6,12,20,22]. To generate captions, semantic concepts or attributes of objects in images are detected and utilized as inputs of the RNN decoder [3,6,12,20,22].…”
Section: Deep Image Captioningmentioning
confidence: 99%
“…The encoder-decoder model first extracts high-level visual features from a CNN trained on the image classification task, and then feeds the visual features into an RNN model to predict subsequent words of a caption for a given image. In recent years, a variety of successive models [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][18][19][20] have achieved promising results. Semantic concept analysis, or attribute prediction [17,21], is a task closely related to image captioning, because attributes can be interpreted as a basis for descriptions.…”
Section: Deep Image Captioningmentioning
confidence: 99%