Image Captioning with Deep Bidirectional LSTMs

Wang, Cheng; Yang, Haojin; Bartz, Christian; Meinel, Christoph

doi:10.1145/2964284.2964299

Cited by 222 publications

(115 citation statements)

References 33 publications

Supporting

Mentioning

115

Contrasting

Order By: Relevance

“…Datasets Evaluation Metrics Kiros et al 2014 [69] IAPR TC-12,SBU BLEU, PPLX Kiros et al 2014 [70] Flickr [90] Flickr 8k, UIUC BLEU, R@K You et al 2016 [156] Flickr 30K, MS COCO BLEU, METEOR, ROUGE, CIDEr Yang et al 2016 [153] Visual Genome METEOR, AP, IoU Anne et al 2016 [6] MS COCO, ImageNet BLEU, METEOR Yao et al 2017 [155] MS COCO BLEU, METEOR, ROUGE, CIDEr Lu et al 2017 [88] Flickr 30K, MS COCO BLEU, METEOR, CIDEr Chen et al 2017 [21] Flickr 8K/30K, MS COCO BLEU, METEOR, ROUGE, CIDEr Gan et al 2017 [41] Flickr [85] MS COCO SPIDEr, Human Evaluation Gu et al 2017 [51] Flickr 30K, MS COCO BLEU, METEOR, CIDEr, SPICE Yao et al 2017 [154] MS COCO, ImageNet METEOR Rennie et al 2017 [120] MS COCO BLEU, METEOR, CIDEr, ROUGE Vsub et al 2017 [140] MS COCO, ImageNet METEOR Zhang et al 2017 [161] MS COCO BLEU, METEOR, ROUGE, CIDEr Wu et al 2018 [150] Flickr 8K/30K, MS COCO BLEU, METEOR, CIDEr Aneja et al 2018 [5] MS COCO BLEU, METEOR, ROUGE, CIDEr Wang et al 2018 [147] MS COCO BLEU, METEOR, ROUGE, CIDEr [21,59,61,144,150,152] have performed experiments using the dataset. Two sample results by Jia et al [59] on this dataset are shown in Figure 13 4.1.4 Visual Genome Dataset.…”

Section: Referencementioning

confidence: 99%

A Comprehensive Survey of Deep Learning for Image Captioning

et al. 2019

View full text Add to dashboard Cite

Generating a description of an image is called image captioning. Image captioning requires to recognize the important objects, their attributes and their relationships in an image. It also needs to generate syntactically and semantically correct sentences. Deep learning-based techniques are capable of handling the complexities and challenges of image captioning. In this survey paper, we aim to present a comprehensive review of existing deep learning-based image captioning techniques. We discuss the foundation of the techniques to analyze their performances, strengths and limitations. We also discuss the datasets and the evaluation metrics popularly used in deep learning based automatic image captioning.

show abstract

Section: Referencementioning

confidence: 99%

A Comprehensive Survey of Deep Learning for Image Captioning

et al. 2019

View full text Add to dashboard Cite

show abstract

“…Seq2Seq-f+b: it fills the blanks by both Seq2Seq-f and Seq2Seq-b, and then selects the output with a maximum of the probabilities assigned by the seq2seq models. This method is used in Wang et al (2016). (Berglund et al, 2015) on a well-trained seq2seq model with BiRNN as the decoder to fill the blanks.…”

Section: Baselinesmentioning

confidence: 99%

TIGS: An Inference Algorithm for Text Infilling with Gradient Search

Liu¹,

Fu²,

Liu³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…Most existing models generate a word sequence one by one in a front-to-back manner, without considering the influence of the subsequent words on the whole sentence generation. Bidirectional LSTMs have been developed to generate sentences from two directions [38,39] independently. Essentially, it is the same way as before since the forward and backward LSTMs are still trained without interaction.…”

Section: Phased Trainable Modelsmentioning

confidence: 99%

“…The encoder-decoder models usually use forward LSTMs to generate words from begin to end to make a sentence [1,2,5]. Recently, bidirectional LSTMs have been developed to generate sentences from two directions independently, i.e., a forward LSTM and a backward LSTM are trained without interaction [38,39]. However, there are three problems unsolved.…”

Section: Introductionmentioning

confidence: 99%

Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style

Yan

Zhang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Image captioning is a research hotspot where encoderdecoder models combining convolutional neural network (CNN) and long short-term memory (LSTM) achieve promising results. Despite significant progress, these models generate sentences differently from human cognitive styles. Existing models often generate a complete sentence from the first word to the end, without considering the influence of the following words on the whole sentence generation. In this paper, we explore the utilization of a human-like cognitive style, i.e., building overall cognition for the image to be described and the sentence to be constructed, for enhancing computer image understanding. This paper first proposes a Mutual-aid network structure with Bidirectional LSTMs (MaBi-LSTMs) for acquiring overall contextual information. In the training process, the forward and backward LSTMs encode the succeeding and preceding words into their respective hidden states by simultaneously constructing the whole sentence in a complementary manner. In the captioning process, the LSTM implicitly utilizes the subsequent semantic information contained in its hidden states. In fact, MaBi-LSTMs can generate two sentences in forward and backward directions. To bridge the gap between cross-domain models and generate a sentence with higher quality, we further develop a cross-modal attention mechanism to retouch the two sentences by fusing their salient parts as well as the salient areas of the image. Experimental results on the Microsoft COCO dataset show that the proposed model improves the performance of encoder-decoder models and achieves state-of-the-art results.

show abstract

Image Captioning with Deep Bidirectional LSTMs

Cited by 222 publications

References 33 publications

A Comprehensive Survey of Deep Learning for Image Captioning

A Comprehensive Survey of Deep Learning for Image Captioning

TIGS: An Inference Algorithm for Text Infilling with Gradient Search

Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style

Contact Info

Product

Resources

About