2020
DOI: 10.1609/aaai.v34i05.6503
|View full text |Cite
|
Sign up to set email alerts
|

Learning Long- and Short-Term User Literal-Preference with Multimodal Hierarchical Transformer Network for Personalized Image Caption

Abstract: Personalized image caption, a natural extension of the standard image caption task, requires to generate brief image descriptions tailored for users' writing style and traits, and is more practical to meet users' real demands. Only a few recent studies shed light on this crucial task and learn static user representations to capture their long-term literal-preference. However, it is insufficient to achieve satisfactory performance due to the intrinsic existence of not only long-term user literal-preference, but… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
2
1

Relationship

0
10

Authors

Journals

citations
Cited by 18 publications
(5 citation statements)
references
References 23 publications
(36 reference statements)
0
5
0
Order By: Relevance
“…This task requires the model to generate captions to describe the content of images [59]. State-of-the-art methods follow multi-modal attention designs, treating the task as a multi-modal translation problem [66,73,75]. Our focus in this work is not to design a new captioning model, but to explore image captioning as a sub-task for open vocabulary learning to enhance the novel class discovery ability.…”
Section: Related Workmentioning
confidence: 99%
“…This task requires the model to generate captions to describe the content of images [59]. State-of-the-art methods follow multi-modal attention designs, treating the task as a multi-modal translation problem [66,73,75]. Our focus in this work is not to design a new captioning model, but to explore image captioning as a sub-task for open vocabulary learning to enhance the novel class discovery ability.…”
Section: Related Workmentioning
confidence: 99%
“…To this end, early approaches exploit a memory block as a repository for this contextual information [207], [208]. On another line, Zhang et al [209] proposed a multi-modal Transformer network that personalizes captions conditioned on the user's recent captions and a learned user representation. Other works have instead focused on the style of captions as an additional controllable input and proposed to solve this task by exploiting unpaired stylized textual corpus [210], [211], [212], [213] and adversarial learning [212].…”
Section: Addressing User Requirementsmentioning
confidence: 99%
“…Inspired by the success of the self-attention mechanism adopted in transformer architectures in the NLP field [10,11], some works employed them in the computer vision field [35][36][37].…”
Section: B Combining Self-attention Mechanisms With Cnnsmentioning
confidence: 99%