Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style

Ge, Hongwei; Yan, Zehang; Zhang, Kai; Zhao, Mingde; Sun, Liang

doi:10.1109/iccv.2019.00184

Cited by 17 publications

(8 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other approaches. The solution based on additive attention over a grid of features has been widely adopted by several following works with minor improvements in terms of visual encoding [29], [32], [34], [35], [36], [37].…”

Section: Attention Over Grid Of Cnn Featuresmentioning

confidence: 99%

“…Hidden state reconstruction -Chen et al [34] proposed to regularize the transition dynamics of the language model by using a second LSTM for reconstructing the previous hidden state based on the current one. Ge et al [36] proposed to better capture context information by using a bidirectional LSTM with an auxiliary module. The auxiliary module in a direction approximates the hidden state of the LSTM in the other direction.…”

Section: Single-layer Lstmmentioning

confidence: 99%

See 1 more Smart Citation

From Show to Tell: A Survey on Deep Learning-based Image Captioning

Stefanini¹,

Cornia²,

Baraldi³

et al. 2021

Preprint

View full text Add to dashboard Cite

Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.

show abstract

Section: Attention Over Grid Of Cnn Featuresmentioning

confidence: 99%

Section: Single-layer Lstmmentioning

confidence: 99%

From Show to Tell: A Survey on Deep Learning-based Image Captioning

Stefanini¹,

Cornia²,

Baraldi³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…To enhance the diversity, GAN-based methods (Dognin et al 2019;Dai et al 2017;Chen et al 2018) are introduced in image captioning. Models proposed in (Zheng, Li, and Wang 2019;Ge et al 2019) change the order of the sentence generation, starting from the middle or the end of sentences. In (Yang et al 2018;Chen et al 2020;Shi et al 2020), scene graphs are employed to further explore the objects, attributes and relationships in the image, which improve the overall performance of captioning models.…”

Section: Image Captioningmentioning

confidence: 99%

Self-Annotated Training for Controllable Image Captioning

Zhu¹,

Wang²,

Qu³

2021

Preprint

View full text Add to dashboard Cite

The Controllable Image Captioning (CIC) task aims to generate captions conditioned on designated control signals. In this paper, we improve CIC from two aspects: 1) Existing reinforcement training methods are not applicable to structurerelated CIC models due to the fact that the accuracy-based reward focuses mainly on contents rather than semantic structures. The lack of reinforcement training prevents the model from generating more accurate and controllable sentences. To solve the problem above, we propose a novel reinforcement training method for structure-related CIC models: Self-Annotated Training (SAT), where a recursive sampling mechanism (RSM) is designed to force the input control signal to match the actual output sentence. Extensive experiments conducted on MSCOCO show that our SAT method improves C-Transformer (XE) on CIDEr-D score from 118.6 to 130.1 in the length-control task and from 132.2 to 142.7 in the tensecontrol task, while maintaining more than 99% matching accuracy with the control signal. 2) We introduce a new control signal: sentence quality. Equipped with it, CIC models are able to generate captions of different quality levels as needed. Experiments show that without additional information of ground truth captions, models controlled by the highest level of sentence quality perform much better in accuracy than baseline models.

show abstract

“…Generally, existing WREG methods consist of two steps; namely, the sentence-level matching and the reconstruction [7][8][9][10]. In the first step, WREG methods roughly assume the sentence-level matching procedures from existing fullysupervised REG methods [5] in order to calculate the similarity between the entire query and each candidate proposal.…”

Section: Introductionmentioning

confidence: 99%

“…1.(a). Its accuracy, however, proves hardly satisfactory even for a full-supervised setting [9,10] and makes the BP loss unreliable. Additionally, we have the architectural imbalance from the heavy RNNstyle reconstruction network, which is never used in the final inference stage while occupying a large proportion of parameters of the entire network (around 75% in [7,8]).…”

Section: Introductionmentioning

confidence: 99%

Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding

Sun

Xiao

Lim

et al. 2021

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

In this paper, we are tackling the weakly-supervised referring expression grounding task, for the localization of a referent object in an image according to a query sentence, where the mapping between image regions and queries are not available during the training stage. In traditional methods, an object region that best matches the referring expression is picked out, and then the query sentence is reconstructed from the selected region, where the reconstruction difference serves as the loss for back-propagation. The existing methods, however, conduct both the matching and the reconstruction approximately as they ignore the fact that the matching correctness is unknown. To overcome this limitation, a discriminative triad is designed here as the basis to the solution, through which a query can be converted into one or multiple discriminative triads in a very scalable way. Based on the discriminative triad, we further propose the triad-level matching and reconstruction modules which are lightweight yet effective for the weakly-supervised training, making it three times lighter and faster than the previous state-of-the-art methods. One important merit of our work is its superior performance despite the simple and neat design. Specifically, the proposed method achieves a new state-of-the-art accuracy when evaluated on RefCOCO (39.21%), RefCOCO+ (39.18%) and RefCOCOg (43.24%) datasets, that is 4.17%, 4.08% and 7.8% higher than the previous one, respectively. The code is available at https://github.com/insomnia94/DTWREG.

show abstract

Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style

Cited by 17 publications

References 39 publications

From Show to Tell: A Survey on Deep Learning-based Image Captioning

From Show to Tell: A Survey on Deep Learning-based Image Captioning

Self-Annotated Training for Controllable Image Captioning

Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding

Contact Info

Product

Resources

About