Efficient Modeling of Future Context for Image Captioning

Fei, Zhengcong

doi:10.1145/3503161.3547840

Cited by 6 publications

(2 citation statements)

References 48 publications

(31 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This gives the resulting sentences more detail in their descriptions of the scene than traditional approaches. More recently, Fei [ 67 ] proposed a model that generates descriptions that effectively exploit the global context of the scene without implying an additional cost of inference. The model is trained with two sets: one contains the description labels, and the other includes the description of the general context of the image.…”

Section: Review and Discussionmentioning

confidence: 99%

Supervised Deep Learning Techniques for Image Description: A Systematic Review

López-Sánchez

Hernández-Ocaña

Chávez-Bosquez

et al. 2023

Entropy

View full text Add to dashboard Cite

Automatic image description, also known as image captioning, aims to describe the elements included in an image and their relationships. This task involves two research fields: computer vision and natural language processing; thus, it has received much attention in computer science. In this review paper, we follow the Kitchenham review methodology to present the most relevant approaches to image description methodologies based on deep learning. We focused on works using convolutional neural networks (CNN) to extract the characteristics of images and recurrent neural networks (RNN) for automatic sentence generation. As a result, 53 research articles using the encoder-decoder approach were selected, focusing only on supervised learning. The main contributions of this systematic review are: (i) to describe the most relevant image description papers implementing an encoder-decoder approach from 2014 to 2022 and (ii) to determine the main architectures, datasets, and metrics that have been applied to image description.

show abstract

Section: Review and Discussionmentioning

confidence: 99%

Supervised Deep Learning Techniques for Image Description: A Systematic Review

López-Sánchez

Hernández-Ocaña

Chávez-Bosquez

et al. 2023

Entropy

View full text Add to dashboard Cite

show abstract

“…Recently, the growing interest in multimodal research (Fei 2022;Li et al 2022a;Chen et al 2022;Jing et al 2020;Ma et al 2022Ma et al , 2023Ji et al 2022;Huang et al 2023;Zhao et al 2023;Wu et al 2023) at the intersection of computer vision and natural language processing has driven the development of systems that can understand and describe the world as humans do. Panoptic Narrative Grounding (PNG) (González et al 2021) is an emerging visuallygrounded language understanding task that aims to locate and segment all instances of objects and regions in an image, corresponding to a given text description using binary pixel masks.…”

Section: Introductionmentioning

confidence: 99%

Improving Panoptic Narrative Grounding by Harnessing Semantic Relationships and Visual Confirmation

Guo,

Wang,

et al. 2024

AAAI

View full text Add to dashboard Cite

Recent advancements in single-stage Panoptic Narrative Grounding (PNG) have demonstrated significant potential. These methods predict pixel-level masks by directly matching pixels and phrases. However, they often neglect the modeling of semantic and visual relationships between phrase-level instances, limiting their ability for complex multi-modal reasoning in PNG. To tackle this issue, we propose XPNG, a “differentiation-refinement-localization” reasoning paradigm for accurately locating instances or regions. In XPNG, we introduce a Semantic Context Convolution (SCC) module to leverage semantic priors for generating distinctive features. This well-crafted module employs a combination of dynamic channel-wise convolution and pixel-wise convolution to embed semantic information and establish inter-object relationships guided by semantics. Subsequently, we propose a Visual Context Verification (VCV) module to provide visual cues, eliminating potential space biases introduced by semantics and further refining the visual features generated by the previous module. Extensive experiments on PNG benchmark datasets reveal that our approach achieves state-of-the-art performance, significantly outperforming existing methods by a considerable margin and yielding a 3.9-point improvement in overall metrics. Our codes and results are available at our project webpage: https://github.com/TianyuGoGO/XPNG.

show abstract