2019
DOI: 10.1016/j.patrec.2017.10.018
|View full text |Cite
|
Sign up to set email alerts
|

Image Caption Generation with Part of Speech Guidance

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
22
0
2

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 62 publications
(27 citation statements)
references
References 3 publications
0
22
0
2
Order By: Relevance
“…Image generation conditioned on natural language [26], also known as text-toimage generation brings us vivid visual representation from text. Several works present different approaches for synthesizing detailed textual descriptions from images or videos [14,19], which are also called image/video caption. Besides, sound and image can be converted to each other in [9].…”
Section: Multi-modal Generationmentioning
confidence: 99%
“…Image generation conditioned on natural language [26], also known as text-toimage generation brings us vivid visual representation from text. Several works present different approaches for synthesizing detailed textual descriptions from images or videos [14,19], which are also called image/video caption. Besides, sound and image can be converted to each other in [9].…”
Section: Multi-modal Generationmentioning
confidence: 99%
“…Kemudian identifikasi objek untuk mengetahui caption yang tepat masih mendapatkan akurasi kecil [14]. Masalah dalam extraksi image untuk mendapatkan fitur yang direlasikan dengan word embedding yang perlu dikaji untuk menghindari overfitting pada konten image [15].…”
Section: Pendahuluanunclassified
“…Pada metode image captioning inilah adanya penggabungan metode dengan natural language processing yang mengupayakan adanya caption terhadap citra yang dibaca. Seperti pada penelitian [14], [15], [17], [31],…”
Section: Pembahasanunclassified
“…In the text generation tasks, the word POS tag is predicted by the previous generated words and state of the decoder recursively. These works can be divided into two categories: (1) treated as a multi-task learning problem: for example, the authors of [26] treated POS tagging as an auxiliary task, i.e., predicting the POS tag for each word to be generated alongside the word generation, and the authors of [27] predicted the POS tag and name entity (NE) tag at the same time as word generation; (2) gate for external features: for example, the authors of [31,32] predicated the POS information for the word as a condition to determine whether the visual (external) feature is essential for current word generation. However, all of the above approaches did not utilize the POS priors to guide the heterogeneous visual feature assembly based on the intrinsic relationship between word class and feature categories.…”
Section: Part-of-speech Predictionmentioning
confidence: 99%