Deep Learning Approaches Based on Transformer Architectures for Image Captioning Tasks

Castro, Roberto; Pineda, Israel; Lim, Wansu; Morocho-Cayamcela, Manuel Eugenio

doi:10.1109/access.2022.3161428

Cited by 27 publications

(12 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CNN compared the target image against a huge dataset of training images, after producing a precise explanation with the help of trained captions. The research scholars in the study conducted earlier [16] aimed at visual attention for which they proposed an advanced technique for image captioning in computer vision research zone. The researchers understood the influence exerted by distinct hyper-parameters over encoder-decoder visual attention structure with regards to efficiency.…”

Section: Literature Reviewmentioning

confidence: 99%

Natural Language Processing with Optimal Deep Learning-Enabled Intelligent Image Captioning System

Marzouk¹,

Alabdulkreem²,

Nour³

et al. 2023

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

The recent developments in Multimedia Internet of Things (MIoT) devices, empowered with Natural Language Processing (NLP) model, seem to be a promising future of smart devices. It plays an important role in industrial models such as speech understanding, emotion detection, home automation, and so on. If an image needs to be captioned, then the objects in that image, its actions and connections, and any silent feature that remains under-projected or missing from the images should be identified. The aim of the image captioning process is to generate a caption for image. In next step, the image should be provided with one of the most significant and detailed descriptions that is syntactically as well as semantically correct. In this scenario, computer vision model is used to identify the objects and NLP approaches are followed to describe the image. The current study develops a Natural Language Processing with Optimal Deep Learning Enabled Intelligent Image Captioning System (NLPODL-IICS). The aim of the presented NLPODL-IICS model is to produce a proper description for input image. To attain this, the proposed NLPODL-IICS follows two stages such as encoding and decoding processes. Initially, at the encoding side, the proposed NLPODL-IICS model makes use of Hunger Games Search (HGS) with Neural Search Architecture Network (NASNet) model. This model represents the input data appropriately by inserting it into a predefined length vector. Besides, during decoding phase, Chimp Optimization Algorithm (COA) with deeper Long Short Term Memory (LSTM) approach is followed to concatenate the description sentences 4436 CMC, 2023, vol.74, no.2 produced by the method. The application of HGS and COA algorithms helps in accomplishing proper parameter tuning for NASNet and LSTM models respectively. The proposed NLPODL-IICS model was experimentally validated with the help of two benchmark datasets. A widespread comparative analysis confirmed the superior performance of NLPODL-IICS model over other models.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

Natural Language Processing with Optimal Deep Learning-Enabled Intelligent Image Captioning System

Marzouk¹,

Alabdulkreem²,

Nour³

et al. 2023

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

show abstract

“…First, the use of depthwise convolution. We only introduce additional 2 sC parameters and 2 () O s CT FLOPs as compared to the linear projection, which is negligible as compared to the total number of parameters and FLOPs in the models. Second, the process of matric sharing S. With this improvement, the number of parameters of key and value are reduced by half.…”

Section: B Convolutional Parameters Sharing Multi-head Attention (Cpsa)mentioning

confidence: 99%

“…Transformers [1], [2] have become a de-facto standard in deep learning and have been widely adopted in various fields. These models have been widely adopted in modern deep learning, such as natural language processing (NLP) [3], [4], [5], computer vision (CV) [6], [7], [8], [9], and speech processing [10], [11], [12], due to their ability to model longrange dependencies.…”

Section: Introductionmentioning

confidence: 99%

Transformers Meet Small Datasets

Shao

2022

IEEE Access

View full text Add to dashboard Cite

The research and application areas of transformers have been extensively enlarged due to the success of vision transformers (ViTs). However, due to the lack of local content acquisition capabilities, the pure transformer architectures cannot be trained directly on small datasets. In this work, we first propose a new hybrid model by combining the transformer and convolution neural network (CNN). The proposed model improves the classification ability on small datasets. This is accomplished by introducing more convolution operations in the transformer's two core sections: 1) Instead of the original multi-head attention mechanism, we design a convolutional parameter sharing multi-head attention (CPSA) block that incorporates the convolutional parameter sharing projection in the attention mechanism; 2) the feed-forward network in each transformer encoder block is replaced with a local feed-forward network (LFFN) block that introduces a sandglass block with more depth-wise convolutions to provide more locality to the transformers. We achieve state-of-the-art results when training from scratch on 4 small datasets as compared with the transformers and CNNs without extensive computing resources and auxiliary training. The proposed strategy opens up new paths for the application of transformers on small datasets.

show abstract

“…Currently, computer vision (CV) tasks are useful for solving problems related to object detection, classification, object counting, visual surveillance, etc., taking advantage of video resources from public surveillance cameras located in many public areas (i.e., shopping malls, supermarkets, airports, train stations, stadiums, etc.) [9][10][11][12]. The problem of the correct/incorrect wearing of face masks implies two CV tasks: (1) object detection and (2) object classification.…”

Section: Introductionmentioning

confidence: 99%

A Computer Vision Model to Identify the Incorrect Use of Face Masks for COVID-19 Awareness

Crespo

Martínez

et al. 2022

Applied Sciences

Self Cite

View full text Add to dashboard Cite

Face mask detection has become a great challenge in computer vision, demanding the coalition of technology with COVID-19 awareness. Researchers have proposed deep learning models to detect the use of face masks. However, the incorrect use of a face mask can be as harmful as not wearing any protection at all. In this paper, we propose a compound convolutional neural network (CNN) architecture based on two computer vision tasks: object localization to discover faces in images/videos, followed by an image classification CNN to categorize the faces and show if someone is using a face mask correctly, incorrectly, or not at all. The first CNN is built upon RetinaFace, a model to detect faces in images, whereas the second CNN uses a ResNet-18 architecture as a classification backbone. Our model enables an accurate identification of people who are not correctly following the COVID-19 healthcare recommendations on face mask use. To enable further global use of our technology, we have released both the dataset used to train the classification model and our proposed computer vision pipeline to the public, and optimized it for embedded systems deployment.

show abstract

Deep Learning Approaches Based on Transformer Architectures for Image Captioning Tasks

Cited by 27 publications

References 25 publications

Natural Language Processing with Optimal Deep Learning-Enabled Intelligent Image Captioning System

Natural Language Processing with Optimal Deep Learning-Enabled Intelligent Image Captioning System

Transformers Meet Small Datasets

A Computer Vision Model to Identify the Incorrect Use of Face Masks for COVID-19 Awareness

Contact Info

Product

Resources

About