Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.664
|View full text |Cite
|
Sign up to set email alerts
|

Improving Image Captioning with Better Use of Caption

Abstract: Image captioning is a multimodal problem that has drawn extensive attention in both the natural language processing and computer vision community. In this paper, we present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. The representati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
19
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
2

Relationship

3
7

Authors

Journals

citations
Cited by 52 publications
(24 citation statements)
references
References 33 publications
0
19
0
Order By: Relevance
“…We compare the proposed MGAN, where MGAs is applied in the encoder and AWG(mean) is adopted in the decoder as shown in Figure 4, with the existing state-of-the-art models on the widely used Karpathy's test split. The models we compared include: NIC [13], SCST [23], LSTM-A [21], Up-Down [3], RFNet [43], GCN-LSTM [14], ETA [9], AoANet [4], Sub-GC [45], MT [44] and NG-SAN [11]. In all the above models, except NIC and LSTM-A, the other models all employ attention mechanism.…”
Section: Performance Comparisonmentioning
confidence: 99%
“…We compare the proposed MGAN, where MGAs is applied in the encoder and AWG(mean) is adopted in the decoder as shown in Figure 4, with the existing state-of-the-art models on the widely used Karpathy's test split. The models we compared include: NIC [13], SCST [23], LSTM-A [21], Up-Down [3], RFNet [43], GCN-LSTM [14], ETA [9], AoANet [4], Sub-GC [45], MT [44] and NG-SAN [11]. In all the above models, except NIC and LSTM-A, the other models all employ attention mechanism.…”
Section: Performance Comparisonmentioning
confidence: 99%
“…It often leverages a CNN or variants as the image encoder and an RNN as the decoder to generate sentences (Vinyals et al, 2015;Karpathy and Fei-Fei, 2015;Donahue et al, 2015;Yang et al, 2016). To improve the performance on reference-based automatic evaluation metrics, previous work has used visual attention mechanism (Anderson et al, 2018;Lu et al, 2017;Pedersoli et al, 2017;Xu et al, 2015;Pan et al, 2020), explicit high-level attributes detection (Yao et al, 2017;You et al, 2016), reinforcement learning methods (Rennie et al, 2017;Ranzato et al, 2015;Liu et al, 2018a), contrastive or adversarial learning , multistep decoding (Liu et al, 2019a;Gu et al, 2018), weighted training by word-image correlation (Ding et al, 2019) and scene graph detection (Yao et al, 2018;Yang et al, 2019;Shi et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
“…Image captioning, aiming at generating visually grounded descriptions for images, often leverage an CNN or variants as the image encoder and an RNN as the decoder to generate sentences [2,10,11]. To improve the performance on reference-based automatic evaluation metrics, visual attention mechanism [1,3,4], explicit high-level attributes detection [12,13], reinforcement learning methods [14], contrastive or adversarial learning [15,16], multi-step decoding [17] and scene graph detection [18,19] are proposed. The work of [20,21] is most related to ours, which uses retrieval loss as a reward signal to produce descriptive captions.…”
Section: Related Workmentioning
confidence: 99%