Image Captioning with Semantic Attention

You, Quanzeng; Jin, Hailin; Wang, Zhaowen; Chen, Fang; Luo, Jiebo

doi:10.1109/cvpr.2016.503

Cited by 1,457 publications

(920 citation statements)

References 26 publications

Supporting

Mentioning

917

Contrasting

Unclassified

Order By: Relevance

“…In recent years, much work has been published on image captioning, including [3,4,9,12,20,22,28,31,33], to name a few. Many proposed captioning models exploit RNN-based decoders to generate a sequence of words from encoded representation of input images.…”

Section: Related Workmentioning

confidence: 99%

“…Most captioning models are equipped with RNN-based encoders (e.g. [3,22,25,28,31,33]), which predict a word at every time step, based on only a current input and a single or a few hidden states as an implicit summary of all previous history. Thus, RNNs and their variants often fails to capture longterm dependencies, which could worsen if one wants to use prior knowledge together.…”

Section: Introductionmentioning

confidence: 99%

“…Image captioning is a task of automatically generating a descriptive sentence of an image [3,4,9,12,20,22,28,31,33]. As this task is often regarded as one of frontier-AI problems, it has been actively studied in recent vision and language research.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Attend to You: Personalized Image Captioning with Context Sequence Memory Networks

Park¹,

Kim

2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

114

View full text Add to dashboard Cite

We address personalization issues of image captioning, which have not been discussed yet in previous research. For a query image, we aim to generate a descriptive sentence, accounting for prior knowledge such as the user's active vocabularies in previous documents. As applications of personalized image captioning, we tackle two post automation tasks: hashtag prediction and post generation, on our newly collected Instagram dataset, consisting of 1.1M posts from 6.3K users. We propose a novel captioning model named Context Sequence Memory Network (CSMN). Its unique updates over previous memory network models include (i) exploiting memory as a repository for multiple types of context information, (ii) appending previously generated words into memory to capture long-term information without suffering from the vanishing gradient problem, and (iii) adopting CNN memory structure to jointly represent nearby ordered memory slots for better context understanding. With quantitative evaluation and user studies via Amazon Mechanical Turk, we show the effectiveness of the three novel features of CSMN and its performance enhancement for personalized image captioning over state-of-the-art captioning models.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Attend to You: Personalized Image Captioning with Context Sequence Memory Networks

Park¹,

Kim

2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

114

View full text Add to dashboard Cite

show abstract

“…CNN has proven to be successful in processing imagelike data, while RNN is more appropriate in modeling sequential data. Recently, several works [8,23,44,48,52,54] have attempted to combine them together, and have built various CNN-RNN frameworks. Generally, the combination can be divided in two types: the unified combination and the cascaded combination.…”

Section: Usage Of Cnn-rnn Frameworkmentioning

confidence: 99%

“…The cascaded CNN-RNN frameworks are often intended for different tasks, rather than image classification. For example, [8,45,52] employed CNN-RNN to address the image captioning task, and [50] utilized CNN-RNN to rank the tag list based on the visual importance.…”

Section: Usage Of Cnn-rnn Frameworkmentioning

confidence: 99%

CNN-RNN: a large-scale hierarchical image classification framework

Guo¹,

Liu²,

Bakker³

et al. 2017

Multimed Tools Appl

112

View full text Add to dashboard Cite

Objects are often organized in a semantic hierarchy of categories, where finelevel categories are grouped into coarse-level categories according to their semantic relations. While previous works usually only classify objects into the leaf categories, we argue that generating hierarchical labels can actually describe how the leaf categories evolved from higher level coarse-grained categories, thus can provide a better understanding of the objects. In this paper, we propose to utilize the CNN-RNN framework to address the hierarchical image classification task. CNN allows us to obtain discriminative features for the input images, and RNN enables us to jointly optimize the classification of coarse and fine labels. This framework can not only generate hierarchical labels for images, but also improve the traditional leaf-level classification performance due to incorporating the hierarchical information. Moreover, this framework can be built on top of any CNN architecture which is primarily designed for leaf-level classification. Accordingly, we build a high performance network based on the CNN-RNN paradigm which outperforms the original CNN (wider-ResNet) and also the current state-of-the-art. In addition, we investigate how to utilize the CNN-RNN framework to improve the fine category classification when a fraction of the training data is only annotated with coarse labels. Experimental results demonstrate that CNN-RNN can use the coarse-labeled training data to improve the classification of fine categories, and in some cases it even surpasses the performance achieved by fully annotated training data. This reveals that, CNN-RNN can alleviate the challenge of specialized and expensive annotation of fine labels.

show abstract

Attention-Based Deep Neural Networks for Detection of Cancerous and Precancerous Esophagus Tissue on Histopathological Slides

et al. 2019

View full text Add to dashboard Cite

IMPORTANCE Deep learning–based methods, such as the sliding window approach for cropped-image classification and heuristic aggregation for whole-slide inference, for analyzing histological patterns in high-resolution microscopy images have shown promising results. These approaches, however, require a laborious annotation process and are fragmented. OBJECTIVE To evaluate a novel deep learning method that uses tissue-level annotations for high-resolution histological image analysis for Barrett esophagus (BE) and esophageal adenocarcinoma detection. DESIGN, SETTING, AND PARTICIPANTS This diagnostic study collected deidentified high-resolution histological images (N = 379) for training a new model composed of a convolutional neural network and a grid-based attention network. Histological images of patients who underwent endoscopic esophagus and gastroesophageal junction mucosal biopsy between January 1, 2016, and December 31, 2018, at Dartmouth-Hitchcock Medical Center (Lebanon, New Hampshire) were collected. MAIN OUTCOMES AND MEASURES The model was evaluated on an independent testing set of 123 histological images with 4 classes: normal, BE-no-dysplasia, BE-with-dysplasia, and adenocarcinoma. Performance of this model was measured and compared with that of the current state-of-the-art sliding window approach using the following standard machine learning metrics: accuracy, recall, precision, and F1 score. RESULTS Of the independent testing set of 123 histological images, 30 (24.4%) were in the BE-nodysplasia class, 14 (11.4%) in the BE-with-dysplasia class, 21 (17.1%) in the adenocarcinoma class, and 58 (47.2%) in the normal class. Classification accuracies of the proposed model were 0.85 (95% CI, 0.81–0.90) for the BE-no-dysplasia class, 0.89 (95% CI, 0.84–0.92) for the BE-with-dysplasia class, and 0.88 (95% CI, 0.84–0.92) for the adenocarcinoma class. The proposed model achieved a mean accuracy of 0.83 (95% CI, 0.80–0.86) and marginally outperformed the sliding window approach on the same testing set. The F1 scores of the attention-based model were at least 8% higher for each class compared with the sliding window approach: 0.68 (95% CI, 0.61–0.75) vs 0.61 (95% CI, 0.53–0.68) for the normal class, 0.72 (95% CI, 0.63–0.80) vs 0.58 (95% CI, 0.45–0.69) for the BE-nodysplasia class, 0.30 (95% CI, 0.11–0.48) vs 0.22 (95% CI, 0.11–0.33) for the BE-with-dysplasia class, and 0.67 (95% CI, 0.54–0.77) vs 0.58 (95% CI, 0.44–0.70) for the adenocarcinoma class. However, this outperformance was not statistically significant. CONCLUSIONS AND RELEVANCE Results of this study suggest that the proposed attention-based deep neural network framework for BE and esophageal adenocarcinoma detection is important because it is based solely on tissue-level annotations, unlike existing methods that are based on regions of interest. This new model is expected to open avenues for applying deep learning to digital pathology.

show abstract

Image Captioning with Semantic Attention

Cited by 1,457 publications

References 26 publications

Attend to You: Personalized Image Captioning with Context Sequence Memory Networks

Attend to You: Personalized Image Captioning with Context Sequence Memory Networks

CNN-RNN: a large-scale hierarchical image classification framework

Attention-Based Deep Neural Networks for Detection of Cancerous and Precancerous Esophagus Tissue on Histopathological Slides

Contact Info

Product

Resources

About