Modeling of Hyperparameter Tuned Deep Learning Model for Automated Image Captioning

Omri, Mohamed; Abdel‐Khalek, S.; Khalil, E. M.; Bouslimi, Jamel; Joshi, Gyanendra Prasad

doi:10.3390/math10030288

Cited by 14 publications

(4 citation statements)

References 23 publications

(25 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…COCO contains many features: Object segmentation, 330K images (>200K labelled), 1.5 million object samples, 91 stuff categories, 80 object categories, and 5 captions per image. In Table II, the overall image captioning results of the AIC-SSAIDL technique with recent models are made on the Flickr8k dataset [25,26]. The experimental values portray the improvement of the AIC-SSAIDL technique.…”

Section: Resultsmentioning

confidence: 99%

Automated Image Captioning Using Sparrow Search Algorithm With Improved Deep Learning Model

Arasi,

Alshahrani,

Alruwais

et al. 2023

IEEE Access

View full text Add to dashboard Cite

Image captioning is a deep learning technique that intends to create and generate textual descriptions or captions for images. It integrates computer vision and natural language processing (NLP) to comprehend the visual content of an image and generate human-like descriptions. Deep learning (DL) based image captioning models can be trained on largescale datasets, allowing them to generalize various types of images and generate captions that apply to a wide range of visual scenarios. By combining computer vision and natural language processing, DL-enabled image captioning models can understand both visual and textual information, which enables them to generate captions that not only describe the visual content but also incorporate contextual and semantic information. This study develops an Automated Image Captioning using Sparrow Search Algorithm with Improved Deep Learning (AIC-SSAIDL) technique. The major intention of the AIC-SSAIDL technique lies in the automated generation of textual captions for the input images. To accomplish this, the AIC-SSAIDL technique utilizes the MobileNetv2 model to generate feature descriptors of the input images and its hyperparameter tuning process takes place using SSA. For the image captioning process, the AIC-SSAIDL technique utilizes an attention mechanism with long short-term memory (AM-LSTM) network. Finally, the hyperparameter selection of the AM-LSTM model is performed by the fruit fly optimization (FFO) algorithm. A wide range of experiments has been conducted on benchmark data to depict the better performance of the AIC-SSAIDL method. The comprehensive result analysis highlighted the enhanced captioning results of the AIC-SSAIDL method with maximum CIDEr of 46.12, 61.89, and 137.45 on Flickr8k, Flickr30k, and MSCOCO datasets, respectively. INDEX TERMS Image captioning; Deep learning; Natural language processing; Sparrow search algorithm; Computer vision.

show abstract

Section: Resultsmentioning

confidence: 99%

Automated Image Captioning Using Sparrow Search Algorithm With Improved Deep Learning Model

Arasi,

Alshahrani,

Alruwais

et al. 2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…This method computes coupling coefficients between the underlying and output capsules in order to update the attention weights. Omri et al [13] and Zhu and Yan [14] proposed deep learning method to improve results of automated image captioning. In the study of Thangave et al [15], provides a heterogeneous data fusion-based deep learning model for image captioning.…”

Section: Related Workmentioning

confidence: 99%

Enhancing Image Captioning and Auto-Tagging Through a FCLN with Faster R-CNN Integration

Deore,

Bagwan,

Bhukan

et al. 2024

IDA

View full text Add to dashboard Cite

In the realm of automated image captioning, which entails generating descriptive text for images, the fusion of Natural Language Processing (NLP) and computer vision techniques is paramount. This study introduces the Fully Convolutional Localization Network (FCLN), a novel approach that concurrently addresses localization and description tasks within a singular forward pass. It maintains spatial information and avoids detail loss, streamlining the training process with consistent optimization. The foundation of FCLN is laid by a Convolutional Neural Network (CNN), adept at extracting salient image features. Central to this architecture is a Localization Layer, pivotal in precise object detection and caption generation. The FCLN architecture amalgamates a region detection network, reminiscent of Faster Region-CNN (R-CNN), with a captioning network. This synergy enables the production of contextually meaningful image captions. The incorporation of the Faster R-CNN framework facilitates regionbased object detection, offering precise contextual understanding and inter-object relationships. Concurrently, a Long Short-Term Memory (LSTM) network is employed for generating captions. This integration yields superior performance in caption accuracy, particularly in complex scenes. Evaluations conducted on the Microsoft Common Objects in Context (MS COCO) test server affirm the model's superiority over existing benchmarks, underscoring its efficacy in generating precise and context-rich image captions.

show abstract

“…Ref. [42] proposes inserting predefined length vectors to generate effective descriptions of input images, using the bird swarm algorithm (BSA) and long short-term memory (LSTM) models for sentence generation, to enhance image captioning performance. Ref.…”

Section: Multimodal Learningmentioning

confidence: 99%

VL-Meta: Vision-Language Models for Multimodal Meta-Learning

Ma,

Fan,

et al. 2024

Mathematics

View full text Add to dashboard Cite

Multimodal learning is a promising area in artificial intelligence (AI) that can make the model understand different kinds of data. Existing works are trying to re-train a new model based on pre-trained models that requires much data, computation power, and time. However, it is difficult to achieve in low-resource or small-sample situations. Therefore, we propose VL-Meta, Vision Language Models for Multimodal Meta Learning. It (1) presents the vision-language mapper and multimodal fusion mapper, which are light model structures, to use the existing pre-trained models to make models understand images to language feature space and save training data, computation power, and time; (2) constructs the meta-task pool that can only use a small amount of data to construct enough training data and improve the generalization of the model to learn the data knowledge and task knowledge; (3) proposes the token-level training that can align inputs with the outputs during training to improve the model performance; and (4) adopts the multi-task fusion loss to learn the different abilities for the models. It achieves a good performance on the Visual Question Answering (VQA) task, which shows the feasibility and effectiveness of the model. This solution can help blind or visually impaired individuals obtain visual information.

show abstract

Modeling of Hyperparameter Tuned Deep Learning Model for Automated Image Captioning

Cited by 14 publications

References 23 publications

Automated Image Captioning Using Sparrow Search Algorithm With Improved Deep Learning Model

Automated Image Captioning Using Sparrow Search Algorithm With Improved Deep Learning Model

Enhancing Image Captioning and Auto-Tagging Through a FCLN with Faster R-CNN Integration

VL-Meta: Vision-Language Models for Multimodal Meta-Learning

Contact Info

Product

Resources

About