A Thorough Review on Recent Deep Learning Methodologies for Image Captioning

Elhagry, Ahmed; Kadaoui, Karima

doi:10.48550/arxiv.2107.13114

Cited by 5 publications

(7 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In particular, Hossain et al [4] demonstrate how well the model can suddenly adjust its focus to the essential item when creating the corresponding words. They develop two mechanisms-a "hard" deterministic attention mechanism and a "soft" deterministic attention mechanism-and train them using conventional back-propagation techniques while maximizing an approximation of the variational lower bound or something analogous [5]. This approach also has the benefit of roughly depicting what it "sees" to glean insights.…”

Section: Methodsmentioning

confidence: 99%

“…This, however, is insufficient since the brain quickly converts a vast amount of visual data into descriptive language. A model is proposed concerning a propose a predictive model utilizing a deep reoccurring architecture after being impressed by the most recent advancement in translation software that recursive human brains (RNN) can complete the translation that typically requires a series of subtasks, and even in a more accurate and much simpler way [5]. Instead of using the decoder RNN, which is initially learned for a classification job, the deep neural network (CNN) is employed.…”

Section: Methodsmentioning

confidence: 99%

“…An end-to-end system, this neural network is wholly trainable and can be improved via stochastic gradient descent. The goal of this model is to maximize conditionally likelihood p(S|I), where S is the output phrase, and I is the input image [5]. They choose to employ a Long-Short Term Recollection Sentence Generator as their RNN alternative, frequently used for translation and creation activities.…”

Section: Methodsmentioning

confidence: 99%

See 2 more Smart Citations

To describe the content of image: The view from image captioning

Hou

2023

ACE

View full text Add to dashboard Cite

The aim of developing the technology of "image captioning," which integrates natural language and computer processing, is to automatically give descriptions for photographs by the machine itself. The work can be separated into two parts, which depends on correctly comprehending both language and images from a semantic and syntactic perspective. In light of the growing body of information on the subject, it is getting harder to stay abreast of the most recent advancements in the area of image captioning. Nevertheless, the review papers that are now available don't go into enough detail about those findings. The approaches, benchmarks, datasets, and assessment metrics currently in use for picture captioning are reviewed in this work. The majority of the field's ongoing study is concentrated on robust learning-based techniques, where deep reinforcement, adversarial learning, and attention processes all seem to be at the heart of this research area. Image captioning entails a brand-new field in research on computer vision. Generating a comprehensive natural language description for the source images is the fundamental issue of image captioning. This essay explores and evaluates earlier work on image captioning. Image captioning's application and task situations are introduced. The merits and disadvantages of each approach are explored after the analysis of the image captioning algorithms based on encoder-decoder and template structure. The assessment and baseline dataset for picture captioning are therefore shown. Ultimately, prospects for image captioning's progress are presented.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

To describe the content of image: The view from image captioning

Hou

2023

ACE

View full text Add to dashboard Cite

show abstract

“…With the enormous surge of social media usage and the continued rapid increase in the visual data generated by users from all around the world, the task of image caption generation is gaining more and more attention and attracting a considerable amount of research efforts from Natural Language Processing (NLP) community and computer vision community [1]. Image captioning; known to be the generation of meaningful captions for images, is a challenging task for machine learning-based models, as they are required to combine both visual and linguistic understanding that includes steps of reasoning to generate high quality captions given an image [2]. However, despite the challenging nature of the image captioning task, recent research endeavors were capable of achieving considerable advances and achievements on this task.…”

Section: Introductionmentioning

confidence: 99%

Bench-Marking And Improving Arabic Automatic Image Captioning Through The Use Of Multi-Task Learning Paradigm

Za’ter¹,

Talafha²

2022

Preprint

View full text Add to dashboard Cite

The continuous increase in the use of social media and the visual content on the internet have accelerated the research in computer vision field in general and the image captioning task in specific. The process of generating a caption that best describes an image is a useful task for various applications such as it can be used in image indexing and as a hearing aid for the visually impaired. In recent years, the image captioning task has witnessed remarkable advances regarding both datasets and architectures, and as a result, the captioning quality has reached an astounding performance. However, the majority of these advances especially in datasets are targeted for English, which left other languages such as Arabic lagging behind. Although Arabic language, being spoken by more than 450 million people and being the most growing language on the internet, lacks the fundamental pillars it needs to advance its image captioning research, such as benchmarks or unified datasets. This work is an attempt to expedite the synergy in this task by providing unified datasets and benchmarks, while also exploring methods and techniques that could enhance the performance of Arabic image captioning. The use of multi-task learning is explored, alongside exploring various word representations and different features. The results showed that the use of multi-task learning and pre-trained word embeddings noticeably enhanced the quality of image captioning, however the presented results show that Arabic captioning still lags behind when compared to the English language. The used dataset and code are available at this link.Keywords Image Captioning • Arabic Language • Multi-task Learning • Benchmark However, the overwhelming majority of the available datasets, algorithms, and research efforts were directed toward English, which leaves other languages lagging in this task. Arabic language for instance, is one of the majorly used languages on the web, used by more than 450 million people, and also among internet users, Arabic is the language with the fastest growth rate in the last few years [8]. Despite these facts, the image captions generation was untouched

show abstract

“…It is noticed that they prefer to focus on specific aspects of this emerging vision to language tasks, such as the technical framework, evaluation indicators, training strategies, or publicly available datasets. However, the existing studies on the review of image captioning have been considered slightly out of vogue or fail to provide a comprehensive overview of the current research, including technologies, benchmark datasets, and evaluation metrics [3,4,[120][121][122]. There is still a lack of literature that comprehensively reviews the research status, innovative technologies, and development prospects.…”

Section: Introductionmentioning

confidence: 99%

A thorough review of models, evaluation metrics, and datasets on image captioning

Luo

Cheng

Chao

et al. 2021

IET Image Processing

View full text Add to dashboard Cite

Image captioning means generate descriptive sentences from a query image automatically. It has recently received widespread attention from the computer vision and natural language processing communities as an emerging visual task. Currently, both components have evolved considerably by exploiting object regions, attributes, attention mechanism methods, entity recognition with novelties, and training strategies. However, despite the impressive results, the research has not yet come to a conclusive answer. This survey aims to provide a comprehensive overview of image captioning methods, from technical architectures to benchmark datasets, evaluation metrics, and comparison of state-of-theart methods. In particular, image captioning methods are divided into different categories based on the technique adopted. Representative methods in each class are summarized, and their advantages and limitations are discussed. Moreover, many related state-of-the-art studies were quantitatively compared to determine the recent trends and future directions in image captioning. The ultimate goal of this work is to serve as a tool for understanding the existing literature and highlighting future directions in the area of image captioning for Computer Vision and Natural Language Processing communities may benefit from.

show abstract

A Thorough Review on Recent Deep Learning Methodologies for Image Captioning

Cited by 5 publications

References 22 publications

To describe the content of image: The view from image captioning

To describe the content of image: The view from image captioning

Bench-Marking And Improving Arabic Automatic Image Captioning Through The Use Of Multi-Task Learning Paradigm

A thorough review of models, evaluation metrics, and datasets on image captioning

Contact Info

Product

Resources

About