Dual-Modal Transformer with Enhanced Inter- and Intra-Modality Interactions for Image Captioning

Kumar, Deepika; Srivastava, V.; Popescu, Daniela; Hemanth, D. Jude

doi:10.3390/app12136733

Cited by 7 publications

(2 citation statements)

References 45 publications

(46 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The transformer is able to avoid any duplication by employing the attention in a comprehensive way between the input and the output. Extensive approaches were proposed to employ the transformer models in image captioning [74][75][76][77][78][79][80][81][82][83][84][85].…”

Section: ) Transformer-basedmentioning

confidence: 99%

A Survey on Attention-Based Models for Image Captioning

Osman¹,

Shalaby²,

Soliman³

et al. 2023

IJACSA

View full text Add to dashboard Cite

Image captioning task is highly used in many realworld applications. The captioning task is concerned with understanding the image using computer vision methods. Then, natural language processing methods are used to produce a description for the image. Different approaches were proposed to solve this task, and deep learning attention-based models have been proven to be the state-of-the-art. A survey on attentionbased models for image captioning is presented in this paper including new categories that were not included in other survey papers. The attention-based approaches are classified into four main categories, further classified into subcategories. All categories and subcategories of the attention-based approaches are discussed in detail. Furthermore, the state-of-the-art approaches are compared and the accuracy improvements are stated especially in the transformer-based models, and a summary of the benchmark datasets and the main performance metrics is presented.

show abstract

Section: ) Transformer-basedmentioning

confidence: 99%

A Survey on Attention-Based Models for Image Captioning

Osman¹,

Shalaby²,

Soliman³

et al. 2023

IJACSA

View full text Add to dashboard Cite

show abstract

“…Kumar et al [61] extracted the feature vectors and detected objects in the image. Then, using the feature, vector embedding and object embedding are created, respectively.…”

Section: Grid Featuresmentioning

confidence: 99%

A Systematic Literature Review on Using the Encoder-Decoder Models for Image Captioning in English and Arabic Languages

Alsayed,

Arif,

Qadah

et al. 2023

Applied Sciences

View full text Add to dashboard Cite

With the explosion of visual content on the Internet, creating captions for images has become a necessary task and an exciting topic for many researchers. Furthermore, image captioning is becoming increasingly important as the number of people utilizing social media platforms grows. While there is extensive research on English image captioning (EIC), studies focusing on image captioning in other languages, especially Arabic, are limited. There has also yet to be an attempt to survey Arabic image captioning (AIC) systematically. This research aims to systematically survey encoder-decoder EIC while considering the following aspects: visual model, language model, loss functions, datasets, evaluation metrics, model comparison, and adaptability to the Arabic language. A systematic review of the literature on EIC and AIC approaches published in the past nine years (2015–2023) from well-known databases (Google Scholar, ScienceDirect, IEEE Xplore) is undertaken. We have identified 52 primary English and Arabic studies relevant to our objectives (The number of articles on Arabic captioning is 11, and the rest are for the English language). The literature review shows that applying the English-specific models to the Arabic language is possible, with the use of a high-quality Arabic database and following the appropriate preprocessing. Moreover, we discuss some limitations and ideas to solve them as a future direction.

show abstract

A real-time image captioning framework using computer vision to help the visually impaired

Safiya,

Pandian

2023

Multimed Tools Appl

View full text Add to dashboard Cite

Dual-Modal Transformer with Enhanced Inter- and Intra-Modality Interactions for Image Captioning

Cited by 7 publications

References 45 publications

A Survey on Attention-Based Models for Image Captioning

A Survey on Attention-Based Models for Image Captioning

A Systematic Literature Review on Using the Encoder-Decoder Models for Image Captioning in English and Arabic Languages

A real-time image captioning framework using computer vision to help the visually impaired

Contact Info

Product

Resources

About