A New Attention-Based LSTM for Image Captioning

Xiao, Fen; Xue, Wenfeng; Shen, Yanqing; Gao, Xieping

doi:10.1007/s11063-022-10759-z

Cited by 21 publications

(6 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Initially, we train our captioning model by minimizing the cross-entropy loss of the output caption where the output token sequence length is restricted to 75 tokens LCapXE(θ)=−∑t=1T log(pδ(y¯ty¯1:t−1),where T denotes the number of words in a sentence; δ is the parameters in the model. In the second step, the reinforcement approach 36 is used to minimize the CIDEr score where we consider the reward in terms of the CIDEr score. We used Adamax optimizer 37 and a learning rate of 5×10−4 to train the model for minimizing the negative expected reward of randomly selected captions as the loss LCapRL(θ)=−Ey1:Ts∼pδ[γ(y1:Ts;y1:T*)],where the reward …”

Section: Methodsmentioning

confidence: 99%

“…where T denotes the number of words in a sentence; δ is the parameters in the model. In the second step, the reinforcement approach 36 is used to minimize the CIDEr score where we consider the reward in terms of the CIDEr score. We used Adamax optimizer 37 and a learning rate of 5 × 10 −4 to train the model for minimizing the negative expected reward of randomly selected captions as the loss…”

Section: Evaluation Methods: Cross-entropy Loss and Cider-dmentioning

confidence: 99%

See 1 more Smart Citation

Improving scene text image captioning using transformer-based multilevel attention

Srivastava

Sharma

2023

J. Electron. Imag.

View full text Add to dashboard Cite

.Many existing image captioning methods only focus on image objects and their relationships for generating image captions, ignoring the text present in an image. Scene text (ST) contains crucial information to understand an image and facilitating reasoning. The existing methods fail to establish strong correlations between optical character recognition (OCR) tokens, as they have limited OCR representation power. Further, these methods have not efficiently used the positional information of the text. In this work, we have proposed an ST-based image captioning model (Trans-MAtt) based on a multilevel attention mechanism and relation network. We have used relation networks to enhance the connections between ST tokens. We have employed a multi-level attention method, which comprises of spatial, semantic, and appearance attention modules that precisely define the image. To represent context-enriched ST tokens, we use a combination of appearance, location, FastText, and PHOC features. We predict the ST location in the image, which is further integrated with the generated word embeddings for final caption generation. Experiments on the TextCaps dataset demonstrate the effectiveness of the proposed Trans-MAtt model, where it outperforms the current best model by 3.4% on B-4, 2.9% on METEOR, 3.3% on ROUGE-L, 3.1% on CIDEr-D, and 4.1% on SPICE metric scores. Our experiments on the Flickr30k and MSCOCO datasets demonstrated the superiority of our proposed model over existing methods.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Evaluation Methods: Cross-entropy Loss and Cider-dmentioning

confidence: 99%

Improving scene text image captioning using transformer-based multilevel attention

Srivastava

Sharma

2023

J. Electron. Imag.

View full text Add to dashboard Cite

show abstract

“…The development of neural network-based machine translation systems has been inspired by the encoder-decoder (ED) picture captioning model. This model utilizes a DL-based framework in which the decoder generates captions using the features extracted by the encoder from the input image [14]. Encoder-decoder (ED)-based image captioning models employ a DL-based framework.…”

Section: Related Workmentioning

confidence: 99%

Enhancing User Profile Authenticity through Automatic Image Caption Generation Using a Bootstrapping Language–Image Pre-Training Model

Bharne,

Bhaladhare

2024

RAiSE-2023

View full text Add to dashboard Cite

show abstract

“…• Additional uses: These days, deep learning is applied in nearly every industry. There are other further deep learning uses, including automated text creation [22], game play [23], and picture captioning [24].…”

Section: Deep Learningmentioning

confidence: 99%

Exploring the Role of Convolutional Neural Networks (CNN) in Dental Radiography Segmentation: A Comprehensive Systematic Literature Review

Brahmi,

Jdey,

Alimi

2024

Preprint

View full text Add to dashboard Cite

In the field of dentistry, there is a growing demand for increased precision in diagnostic tools, with a specific focus on advanced imaging techniques such as computed tomography, cone beam computed tomography, magnetic resonance imaging, ultrasound, and traditional intra-oral periapical X-rays. Deep learning has emerged as a pivotal tool in this context, enabling the implementation of automated segmentation techniques crucial for extracting essential diagnostic data. This integration of cutting-edge technology addresses the urgent need for effective management of dental conditions, which, if left undetected, can have a significant impact on human health. The impressive track record of deep learning across various domains, including dentistry, underscores its potential to revolutionize early detection and treatment of oral health issues. Objective: Having demonstrated significant results in diagnosis and prediction, deep convolutional neural networks (CNNs) represent an emerging field of multidisciplinary research. The goals of this study were to provide a concise overview of the state of the art, standardize the current debate, and establish baselines for future research. Method: In this study, a systematic literature review is employed as a methodology to identify and select relevant studies that specifically investigate the deep learning technique for dental imaging analysis. This study elucidates the methodological approach, including the systematic collection of data, statistical analysis, and subsequent dissemination of outcomes. Results: In incorporating 45 studies, we identified selection criteria and research objectives, addressing significant gaps in the existing literature. These studies assist clinicians in examining dental conditions and classifying dental structures, including caries detection and the identification of various tooth types. We evaluated model performance, addressing the identified gaps, using diverse metrics that we strive to list and explain. Conclusion:This work demonstrates how Convolutional Neural Networks (CNNs) can be employed to analyze images, serving as effective tools for detecting dental pathologies. Although this research acknowledged some limitations, CNNs utilized for segmenting and categorizing teeth exhibited their highest level of performance overall.

show abstract

A New Attention-Based LSTM for Image Captioning

Cited by 21 publications

References 27 publications

Improving scene text image captioning using transformer-based multilevel attention

Improving scene text image captioning using transformer-based multilevel attention

Enhancing User Profile Authenticity through Automatic Image Caption Generation Using a Bootstrapping Language–Image Pre-Training Model

Exploring the Role of Convolutional Neural Networks (CNN) in Dental Radiography Segmentation: A Comprehensive Systematic Literature Review

Contact Info

Product

Resources

About