2017 IEEE International Conference on Computer Vision (ICCV) 2017
DOI: 10.1109/iccv.2017.138
|View full text |Cite
|
Sign up to set email alerts
|

An Empirical Study of Language CNN for Image Captioning

Abstract: Language models based on recurrent neural networks have dominated recent image caption generation tasks. In this paper, we introduce a language CNN model which is suitable for statistical language modeling tasks and shows competitive performance in image captioning. In contrast to previous models which predict next word based on one previous word and hidden state, our language CNN is fed with all the previous words and can model the long-range dependencies in history words, which are critical for image caption… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
65
0
1

Year Published

2018
2018
2019
2019

Publication Types

Select...
6
1
1

Relationship

4
4

Authors

Journals

citations
Cited by 128 publications
(66 citation statements)
references
References 40 publications
0
65
0
1
Order By: Relevance
“…In this experiment, we first train the network with Eq. (15), and then fine-tune it with Eq. (16) on the sentence corpus.…”
Section: Quantitative Resultsmentioning
confidence: 99%
“…In this experiment, we first train the network with Eq. (15), and then fine-tune it with Eq. (16) on the sentence corpus.…”
Section: Quantitative Resultsmentioning
confidence: 99%
“…Therefore, recently, convolutional architectures are used in other sequence to sequence tasks, e.g., conditional image generation [137] and machine translation [42,43,138]. Inspired by the above success of CNNs in sequence learning tasks, Gu et al [51] proposed a CNN language model-based image captioning method. This method uses a language-CNN for statistical language modelling.…”
Section: Lstm Vs Othersmentioning
confidence: 99%
“…Datasets Evaluation Metrics Kiros et al 2014 [69] IAPR TC-12,SBU BLEU, PPLX Kiros et al 2014 [70] Flickr [90] Flickr 8k, UIUC BLEU, R@K You et al 2016 [156] Flickr 30K, MS COCO BLEU, METEOR, ROUGE, CIDEr Yang et al 2016 [153] Visual Genome METEOR, AP, IoU Anne et al 2016 [6] MS COCO, ImageNet BLEU, METEOR Yao et al 2017 [155] MS COCO BLEU, METEOR, ROUGE, CIDEr Lu et al 2017 [88] Flickr 30K, MS COCO BLEU, METEOR, CIDEr Chen et al 2017 [21] Flickr 8K/30K, MS COCO BLEU, METEOR, ROUGE, CIDEr Gan et al 2017 [41] Flickr [85] MS COCO SPIDEr, Human Evaluation Gu et al 2017 [51] Flickr 30K, MS COCO BLEU, METEOR, CIDEr, SPICE Yao et al 2017 [154] MS COCO, ImageNet METEOR Rennie et al 2017 [120] MS COCO BLEU, METEOR, CIDEr, ROUGE Vsub et al 2017 [140] MS COCO, ImageNet METEOR Zhang et al 2017 [161] MS COCO BLEU, METEOR, ROUGE, CIDEr Wu et al 2018 [150] Flickr 8K/30K, MS COCO BLEU, METEOR, CIDEr Aneja et al 2018 [5] MS COCO BLEU, METEOR, ROUGE, CIDEr Wang et al 2018 [147] MS COCO BLEU, METEOR, ROUGE, CIDEr [21,59,61,144,150,152] have performed experiments using the dataset. Two sample results by Jia et al [59] on this dataset are shown in Figure 13 4.1.4 Visual Genome Dataset.…”
Section: Referencementioning
confidence: 99%
“…The goal of the image encoder is to obtain a feature representation of an image. In this paper, we adopt an Image-CNN to extract visual features, which is similar to other image captioning models [43,39,13,3]. The main difference is that our GHA is able to use convolutional feature maps at different levels, which is shown in Figure 1.…”
Section: Gated Hierarchical Attentionmentioning
confidence: 99%