Learning Long- and Short-Term User Literal-Preference with Multimodal Hierarchical Transformer Network for Personalized Image Caption

Zhang, Wei; Yue, Ying; Lu, Pan; Zha, Hongyuan

doi:10.1609/aaai.v34i05.6503

Cited by 18 publications

(5 citation statements)

References 23 publications

(36 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This task requires the model to generate captions to describe the content of images [59]. State-of-the-art methods follow multi-modal attention designs, treating the task as a multi-modal translation problem [66,73,75]. Our focus in this work is not to design a new captioning model, but to explore image captioning as a sub-task for open vocabulary learning to enhance the novel class discovery ability.…”

Section: Related Workmentioning

confidence: 99%

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

Jianzong¹,

Li²,

Ding³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

Jianzong¹,

Li²,

Ding³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…To this end, early approaches exploit a memory block as a repository for this contextual information [207], [208]. On another line, Zhang et al [209] proposed a multi-modal Transformer network that personalizes captions conditioned on the user's recent captions and a learned user representation. Other works have instead focused on the style of captions as an additional controllable input and proposed to solve this task by exploiting unpaired stylized textual corpus [210], [211], [212], [213] and adversarial learning [212].…”

Section: Addressing User Requirementsmentioning

confidence: 99%

From Show to Tell: A Survey on Deep Learning-based Image Captioning

Stefanini¹,

Cornia²,

Baraldi³

et al. 2021

Preprint

View full text Add to dashboard Cite

Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.

show abstract

“…Inspired by the success of the self-attention mechanism adopted in transformer architectures in the NLP field [10,11], some works employed them in the computer vision field [35][36][37].…”

Section: B Combining Self-attention Mechanisms With Cnnsmentioning

confidence: 99%

Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds

Liu¹,

Gao²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

The infrared small-dim target detection is one of the key techniques in the infrared search and tracking system. Since the local regions similar to infrared small-dim targets spread over the whole background, exploring the interaction information amongst image features in large-range dependencies to mine the difference between the target and background is crucial for robust detection. However, existing deep learningbased methods are limited by the locality of convolutional neural networks, which impairs the ability to capture large-range dependencies. To this end, we propose a new infrared small-dim target detection method with the transformer. We adopt the selfattention mechanism of the transformer to learn the interaction information of image features in a larger range. Additionally, we design a feature enhancement module to learn more features of small-dim targets. After that, we adopt a decoder with the U-Net-like skip connection operation to get the detection result. Extensive experiments on two public datasets show the obvious superiority of the proposed method over state-of-the-art methods.

show abstract

Learning Long- and Short-Term User Literal-Preference with Multimodal Hierarchical Transformer Network for Personalized Image Caption

Cited by 18 publications

References 23 publications

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

From Show to Tell: A Survey on Deep Learning-based Image Captioning

Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds

Contact Info

Product

Resources

About