2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00184
|View full text |Cite
|
Sign up to set email alerts
|

Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style

Abstract: Image captioning is a research hotspot where encoderdecoder models combining convolutional neural network (CNN) and long short-term memory (LSTM) achieve promising results. Despite significant progress, these models generate sentences differently from human cognitive styles. Existing models often generate a complete sentence from the first word to the end, without considering the influence of the following words on the whole sentence generation. In this paper, we explore the utilization of a human-like cogniti… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 17 publications
(8 citation statements)
references
References 39 publications
0
8
0
Order By: Relevance
“…Other approaches. The solution based on additive attention over a grid of features has been widely adopted by several following works with minor improvements in terms of visual encoding [29], [32], [34], [35], [36], [37].…”
Section: Attention Over Grid Of Cnn Featuresmentioning
confidence: 99%
See 1 more Smart Citation
“…Other approaches. The solution based on additive attention over a grid of features has been widely adopted by several following works with minor improvements in terms of visual encoding [29], [32], [34], [35], [36], [37].…”
Section: Attention Over Grid Of Cnn Featuresmentioning
confidence: 99%
“…Hidden state reconstruction -Chen et al [34] proposed to regularize the transition dynamics of the language model by using a second LSTM for reconstructing the previous hidden state based on the current one. Ge et al [36] proposed to better capture context information by using a bidirectional LSTM with an auxiliary module. The auxiliary module in a direction approximates the hidden state of the LSTM in the other direction.…”
Section: Single-layer Lstmmentioning
confidence: 99%
“…To enhance the diversity, GAN-based methods (Dognin et al 2019;Dai et al 2017;Chen et al 2018) are introduced in image captioning. Models proposed in (Zheng, Li, and Wang 2019;Ge et al 2019) change the order of the sentence generation, starting from the middle or the end of sentences. In (Yang et al 2018;Chen et al 2020;Shi et al 2020), scene graphs are employed to further explore the objects, attributes and relationships in the image, which improve the overall performance of captioning models.…”
Section: Image Captioningmentioning
confidence: 99%
“…Generally, existing WREG methods consist of two steps; namely, the sentence-level matching and the reconstruction [7][8][9][10]. In the first step, WREG methods roughly assume the sentence-level matching procedures from existing fullysupervised REG methods [5] in order to calculate the similarity between the entire query and each candidate proposal.…”
Section: Introductionmentioning
confidence: 99%
“…1.(a). Its accuracy, however, proves hardly satisfactory even for a full-supervised setting [9,10] and makes the BP loss unreliable. Additionally, we have the architectural imbalance from the heavy RNNstyle reconstruction network, which is never used in the final inference stage while occupying a large proportion of parameters of the entire network (around 75% in [7,8]).…”
Section: Introductionmentioning
confidence: 99%