Integrating Part of Speech Guidance for Image Captioning

Ji, Zhang; Mei, Kuizhi; Zheng, Yu; Fan, Jianping

doi:10.1109/tmm.2020.2976552

Cited by 40 publications

(7 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Image captioning provides a variety of approaches that link the visual contents with normal language, e.g., explaining images with textual descriptions [11,12]. In the existing literature, artificial neural network-based models were utilized to encode visual information with pre trained classification networks such as CNN and RNN [13].…”

Section: Introductionmentioning

confidence: 99%

Medical Image Captioning Using Optimized Deep Learning Model

Singh

Raguru

Prasad

et al. 2022

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

Medical image captioning provides the visual information of medical images in the form of natural language. It requires an efficient approach to understand and evaluate the similarity between visual and textual elements and to generate a sequence of output words. A novel show, attend, and tell model (ATM) is implemented, which considers a visual attention approach using an encoder-decoder model. But the show, attend, and tell model is sensitive to its initial parameters. Therefore, a Strength Pareto Evolutionary Algorithm-II (SPEA-II) is utilized to optimize the initial parameters of the ATM. Finally, experiments are considered using the benchmark data sets and competitive medical image captioning techniques. Performance analysis shows that the SPEA-II-based ATM performs significantly better as compared to the existing models.

show abstract

Section: Introductionmentioning

confidence: 99%

Medical Image Captioning Using Optimized Deep Learning Model

Singh

Raguru

Prasad

et al. 2022

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

show abstract

“…While this method was the first attempt toward integrating the POS to emphasize the images, the words with multiple POSs were not considered because a pre-made POS dictionary was used. Zhang et al [11] proposed the POS Guidance module, a method to use POS as a guide in image captioning. They proposed two models using POS as a guide for the injectbased method and the merge-based method, which are the most used among the image captioning methods analyzed by Tanti et al [45].…”

Section: Theoretical Backgroundmentioning

confidence: 99%

“…Most studies in the image caption area apply the encoderdecoder framework, which consists of an encoder that extracts features from an image and a decoder that generates sentences due to the development of deep learning. Unlike conventional methods [6]- [9], this structure can create various captions from scenes without using a fixed sentence template [11]- [14]. This description has a more unconstrained structure than before, detailing the context.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Image Captioning Model Using Part-of-Speech Guidance Module for Description With Diverse Vocabulary

et al. 2022

View full text Add to dashboard Cite

Image captions aim to generate human-like sentences that describe the image's content. Recent developments in deep learning (DL) have made it possible to caption images for accurate descriptions and detailed expressions. However, since DL learns the relationship between images and captions, it constructs sentences based on commonly frequented words in the dataset. Although these generated sentences are highly accurate, they have low lexical diversity, unlike humans due to limited vocabulary. Therefore, in this paper, we propose a Part-Of-Speech (POS) guidance module and a multimodal-based image captioning model that determines the intensity of images and word sequences and generates sentences through POS to enhance the lexical diversity of DL. The proposed POS guidance module enables rich expression by controlling the information of images and sequences based on the predicted POS guidance to predict words. Then, the POS multimodal layer adds POS and output vector of Bi-LSTM using the multimodal layer to predict the next caption, considering the grammatical structure. We trained and tested the proposed model on the Flicker 30K and MS COCO datasets and compared them with current state-of-the-art studies. Also, we analyzed the lexical diversity of the caption model through the Type-Token Ratio (TTR) and confirmed that the proposed model generates sentences using several words.

show abstract

“…Image captioning [1], [2], [3] aims at describing the content and event of an image using a couple of words. We can Fig.…”

Section: Introductionmentioning

confidence: 99%