Image Caption Generation with Part of Speech Guidance

He, Xinwei; Shi, Baoguang; Xia, Gui-Song; Zhang, Zhaoxiang; Dong, Weisheng

doi:10.1016/j.patrec.2017.10.018

Cited by 62 publications

(27 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Image generation conditioned on natural language [26], also known as text-toimage generation brings us vivid visual representation from text. Several works present different approaches for synthesizing detailed textual descriptions from images or videos [14,19], which are also called image/video caption. Besides, sound and image can be converted to each other in [9].…”

Section: Multi-modal Generationmentioning

confidence: 99%

Talking Face Generation with Expression-Tailored Generative Adversarial Network

Zeng

Han

Lin

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

A key of automatically generating vivid talking faces is to synthesize identity-preserving natural facial expressions beyond audio-lip synchronization, which usually need to disentangle the informative features from multiple modals and then fuse them together. In this paper, we propose an end-to-end Expression-Tailored Generative Adversarial Network (ET-GAN) to generate an expression enriched talking face video of arbitrary identity. Different from talking face generation based on identity image and audio, an expressional video of arbitrary identity serves as the expression source in our approach. Expression encoder is proposed to disentangle expression-tailored representation from the guiding expressional video, while audio encoder disentangles audio-lip representation. Instead of using single image as identity input, multi-image identity encoder is proposed by learning different views of faces and merging a unified representation. Multiple discriminators are exploited to keep both image-aware and the video-aware realistic details, including a spatial-temporal discriminator for visual continuity of expression synthesis and facial movements. We conduct extensive experimental evaluations on quantitative metrics, expression retention quality and audiovisual synchronization. The results show the effectiveness of our ET-GAN in generating high quality expressional talking face videos against existing state-of-the-arts.

show abstract

Section: Multi-modal Generationmentioning

confidence: 99%

Talking Face Generation with Expression-Tailored Generative Adversarial Network

Zeng

Han

Lin

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…Kemudian identifikasi objek untuk mengetahui caption yang tepat masih mendapatkan akurasi kecil [14]. Masalah dalam extraksi image untuk mendapatkan fitur yang direlasikan dengan word embedding yang perlu dikaji untuk menghindari overfitting pada konten image [15].…”

Section: Pendahuluanunclassified

“…Pada metode image captioning inilah adanya penggabungan metode dengan natural language processing yang mengupayakan adanya caption terhadap citra yang dibaca. Seperti pada penelitian [14], [15], [17], [31],…”

Section: Pembahasanunclassified

Image Captioning menurut Scientific Revolution Kuhn dan Popper

Nursikuwagus

Munir

Khodra

2020

JAMIKA

View full text Add to dashboard Cite

Perkembangan untuk memberikan caption pada suatu gambar merupakan suatu ranah perkembangan baru dalam bidang intelejensia buatan. Image captioning merupakan penggabungan dari beberapa bidang seperti computer vision, natural language, dan pembelajaran mesin. Aspek yang menjadi perhatian dalam bidang image captioning ini adalah ketepatan arsitektur neural network yang dimodelkan untuk mendapatkan hasil yang sedekat mungkin dengan ground-thruth yang disampaikan oleh person. Beberapa kajian yang sudah diteliti masih mendapatkan kalimat yang masih jauh dari ground-thruth tersebut. Permasalahan yang dibahas pada umumnya mengenai image captioning adalah image generator dan text generator yaitu penggunaan deep learning seperti CNN dan LSTM untuk menyelesaikan masalah captioning. Hal ini menjadi dasar permasalahan untuk memberikan kontribusi baru dalam bidang image captioning yang meliputi image extractor, text generator, dan evaluator yang bisa digunakan pada model yang diusulkan. Perspektif Kuhn dan Popper dalam hal image captioning, diperoleh bahwa caption dalam bidang geologi sangat diperlukan dan mencapai tahap krisis. Perlu adanya metode usulan baru untuk menyajikan caption untuk citra geologi.

show abstract

“…In the text generation tasks, the word POS tag is predicted by the previous generated words and state of the decoder recursively. These works can be divided into two categories: (1) treated as a multi-task learning problem: for example, the authors of [26] treated POS tagging as an auxiliary task, i.e., predicting the POS tag for each word to be generated alongside the word generation, and the authors of [27] predicted the POS tag and name entity (NE) tag at the same time as word generation; (2) gate for external features: for example, the authors of [31,32] predicated the POS information for the word as a condition to determine whether the visual (external) feature is essential for current word generation. However, all of the above approaches did not utilize the POS priors to guide the heterogeneous visual feature assembly based on the intrinsic relationship between word class and feature categories.…”

Section: Part-of-speech Predictionmentioning

confidence: 99%

Learn and Tell: Learning Priors for Image Caption Generation

2020

View full text Add to dashboard Cite

In this work, we propose a novel priors-based attention neural network (PANN) for image captioning, which aims at incorporating two kinds of priors, i.e., the probabilities being mentioned for local region proposals (PBM priors) and part-of-speech clues for caption words (POS priors), into a visual information extraction process at each word prediction. This work was inspired by the intuitions that region proposals have different inherent probabilities for image captioning, and that the POS clues bridge the word class (part-of-speech tag) with the categories of visual features. We propose new methods to extract these two priors, in which the PBM priors are obtained by computing the similarities between the caption feature vector and local feature vectors, while the POS priors are predicated at each step of word generation by taking the hidden state of the decoder as input. After that, these two kinds of priors are further incorporated into the PANN module of the decoder to help the decoder extract more accurate visual information for the current word generation. In our experiments, we qualitatively analyzed the proposed approach and quantitatively evaluated several captioning schemes with our PANN on the MS-COCO dataset. Experimental results demonstrate that our proposed method could achieve better performance as well as the effectiveness of the proposed network for image captioning.

show abstract

Image Caption Generation with Part of Speech Guidance

Cited by 62 publications

References 3 publications

Talking Face Generation with Expression-Tailored Generative Adversarial Network

Talking Face Generation with Expression-Tailored Generative Adversarial Network

Image Captioning menurut Scientific Revolution Kuhn dan Popper

Learn and Tell: Learning Priors for Image Caption Generation

Contact Info

Product

Resources

About