“…It often leverages a CNN or variants as the image encoder and an RNN as the decoder to generate sentences (Vinyals et al, 2015;Karpathy and Fei-Fei, 2015;Donahue et al, 2015;Yang et al, 2016). To improve the performance on reference-based automatic evaluation metrics, previous work has used visual attention mechanism (Anderson et al, 2018;Lu et al, 2017;Pedersoli et al, 2017;Xu et al, 2015;Pan et al, 2020), explicit high-level attributes detection (Yao et al, 2017;You et al, 2016), reinforcement learning methods (Rennie et al, 2017;Ranzato et al, 2015;Liu et al, 2018a), contrastive or adversarial learning , multistep decoding (Liu et al, 2019a;Gu et al, 2018), weighted training by word-image correlation (Ding et al, 2019) and scene graph detection (Yao et al, 2018;Yang et al, 2019;Shi et al, 2020).…”