Improving Image Captioning with Better Use of Caption

Shi, Zhan; Zhou, Xu; Qiu, Xipeng; Zhu, Xiaodan

doi:10.18653/v1/2020.acl-main.664

Cited by 52 publications

(24 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare the proposed MGAN, where MGAs is applied in the encoder and AWG(mean) is adopted in the decoder as shown in Figure 4, with the existing state-of-the-art models on the widely used Karpathy's test split. The models we compared include: NIC [13], SCST [23], LSTM-A [21], Up-Down [3], RFNet [43], GCN-LSTM [14], ETA [9], AoANet [4], Sub-GC [45], MT [44] and NG-SAN [11]. In all the above models, except NIC and LSTM-A, the other models all employ attention mechanism.…”

Section: Performance Comparisonmentioning

confidence: 99%

Multi-Gate Attention Network for Image Captioning

Jiang

et al. 2021

IEEE Access

View full text Add to dashboard Cite

Self-attention mechanism, which has been successfully applied to current encoder-decoder framework of image captioning, is used to enhance the feature representation in the image encoder and capture the most relevant information for the language decoder. However, most existing methods will assign attention weights to all candidate vectors, which implicitly hypothesizes that all vectors are relevant. Moreover, current self-attention mechanisms ignore the intra-object attention distribution, and only consider the inter-object relationships. In this paper, we propose a Multi-Gate Attention (MGA) block, which expands the traditional self-attention by equipping with additional Attention Weight Gate (AWG) module and Self-Gated (SG) module. The former constrains the attention weights to be assigned to the most contributive objects. The latter is adopted to consider the intra-object attention distribution and eliminate the irrelevant information in object feature vector. Furthermore, most current image captioning methods apply the original transformer designed for natural language processing task, to refine image features directly. Therefore, we propose a pre-layernorm transformer to simplify the transformer architecture and make it more efficient for image feature enhancement. By integrating MGA block with pre-layernorm transformer architecture into the image encoder and AWG module into the language decoder, we present a novel Multi-Gate Attention Network (MGAN). The experiments on MS COCO dataset indicate that the MGAN outperforms most of the state-of-the-art, and further experiments on other methods combined with MGA blocks demonstrate the generalizability of our proposal. INDEX TERMS Image captioning, self-attention, transformer, multi-gate attention.

show abstract

Section: Performance Comparisonmentioning

confidence: 99%

Multi-Gate Attention Network for Image Captioning

Jiang

et al. 2021

IEEE Access

View full text Add to dashboard Cite

show abstract

“…It often leverages a CNN or variants as the image encoder and an RNN as the decoder to generate sentences (Vinyals et al, 2015;Karpathy and Fei-Fei, 2015;Donahue et al, 2015;Yang et al, 2016). To improve the performance on reference-based automatic evaluation metrics, previous work has used visual attention mechanism (Anderson et al, 2018;Lu et al, 2017;Pedersoli et al, 2017;Xu et al, 2015;Pan et al, 2020), explicit high-level attributes detection (Yao et al, 2017;You et al, 2016), reinforcement learning methods (Rennie et al, 2017;Ranzato et al, 2015;Liu et al, 2018a), contrastive or adversarial learning , multistep decoding (Liu et al, 2019a;Gu et al, 2018), weighted training by word-image correlation (Ding et al, 2019) and scene graph detection (Yao et al, 2018;Yang et al, 2019;Shi et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Enhancing Descriptive Image Captioning with Natural Language Inference

Shi¹,

Liu²,

Zhu³

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

Generating descriptive sentences that convey non-trivial, detailed, and salient information about images is an important goal of image captioning. In this paper we propose a novel approach to encourage captioning models to produce more detailed captions using natural language inference, based on the motivation that, among different captions of an image, descriptive captions are more likely to entail less descriptive captions. Specifically, we construct directed inference graphs for reference captions based on natural language inference. A PageRank algorithm is then employed to estimate the descriptiveness score of each node. Built on that, we use reference sampling and weighted designated rewards to guide captioning to generate descriptive captions. The results on MSCOCO show that the proposed method outperforms the baselines significantly on a wide range of conventional and descriptiveness-related evaluation metrics 1 .

show abstract

“…Image captioning, aiming at generating visually grounded descriptions for images, often leverage an CNN or variants as the image encoder and an RNN as the decoder to generate sentences [2,10,11]. To improve the performance on reference-based automatic evaluation metrics, visual attention mechanism [1,3,4], explicit high-level attributes detection [12,13], reinforcement learning methods [14], contrastive or adversarial learning [15,16], multi-step decoding [17] and scene graph detection [18,19] are proposed. The work of [20,21] is most related to ours, which uses retrieval loss as a reward signal to produce descriptive captions.…”

Section: Related Workmentioning

confidence: 99%

Descriptive Image Captioning with Salient Retrieval Priors

Shi

Liu

Zhu

2021

Proceedings of the Canadian Conference on Artificial Intelligence

Self Cite

View full text Add to dashboard Cite

Captions are often expected to carry detailed, essential information of images, but current image captioning models tend to play safe and generate generic captions that is less informative. Cross-modal retrieval is a promising solution as texts with more details has better performance in retrieval. In this work, we first explore two types of salient n-grams, i.e., Support N-grams (SN) and Deletion N-grams (DN), in captions which significantly affect the performance of typical cross-modal retrieval models. We further exploit these n-grams to enhance the original learning objectives for generating descriptive captions with more details. The experiments on two benchmark datasets show that our proposed model outperforms baselines significantly when evaluated with a wide range of metrics.

show abstract

Improving Image Captioning with Better Use of Caption

Cited by 52 publications

References 33 publications

Multi-Gate Attention Network for Image Captioning

Multi-Gate Attention Network for Image Captioning

Enhancing Descriptive Image Captioning with Natural Language Inference

Descriptive Image Captioning with Salient Retrieval Priors

Contact Info

Product

Resources

About