DAA: Dual LSTMs with adaptive attention for image captioning

Xiao, Fen; Gong, Xue; Zhang, Yiming; Shen, Yanqing; Li, Jun; Gao, Xieping

doi:10.1016/j.neucom.2019.06.085

Cited by 30 publications

(9 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The developed technique gives the better quality in the generation of image captions. Fen Xiao et al [32] developed an image captioning framework with dual LSTM for enhancing accessibility of blind people. In that, two separate LSTM frameworks were integrated with the adaptive semantic attention framework.…”

Section: Contributionsmentioning

confidence: 99%

An accurate generation of image captions for blind people using extended convolutional atom neural network

Tiwary¹,

Mahapatra

2022

Multimed Tools Appl

View full text Add to dashboard Cite

Recently, the progress on image understanding and AIC (Automatic Image Captioning) has attracted lots of researchers to make use of AI (Artificial Intelligence) models to assist the blind people. AIC integrates the principle of both computer vision and NLP (Natural Language Processing) to generate automatic language descriptions in relation to the image observed. This work presents a new assistive technology based on deep learning which helps the blind people to distinguish the food items in online grocery shopping. The proposed AIC model involves the following steps such as Data Collection, Noncaptioned image selection, Extraction of appearance, texture features and Generation of automatic image captions. Initially, the data is collected from two public sources and the selection of non-captioned images are done using the ARO (Adaptive Rain Optimization). Next, the appearance feature is extracted using SDM (Spatial Derivative and Multiscale) approach and WPLBP (Weighted Patch Local Binary Pattern) is used in the extraction of texture features. Finally, the captions are automatically generated using ECANN (Extended Convolutional Atom Neural Network). ECANN model combines the CNN (Convolutional Neural Network) and LSTM (Long Short-Term Memory) architectures to perform the caption reusable system to select the most accurate caption. The loss in the ECANN architecture is minimized using AAS (Adaptive Atom Search) Optimization algorithm. The implementation tool used is PYTHON and the dataset used for the analysis are Grocery datasets (Freiburg Groceries and Grocery Store Dataset). The proposed ECANN model acquired accuracy (99.46%) on Grocery Store Dataset and (99.32%) accuracy on Freiburg Groceries dataset. Thus, the performance of the proposed ECANN model is compared with other existing models to verify the supremacy of the proposed work over the other existing works.

show abstract

Section: Contributionsmentioning

confidence: 99%

An accurate generation of image captions for blind people using extended convolutional atom neural network

Tiwary¹,

Mahapatra

2022

Multimed Tools Appl

View full text Add to dashboard Cite

show abstract

“…However, RNN can only remember the short distance information in the information sequence. e special structure of LSTM network [33] makes the network have the ability to memorize long-distance information. RNN neurons store effective information in uncontrollable form in each time step, while the LSTM network uses the special learning mechanism to integrate and update the information of the last time point, effectively avoiding the phenomenon of gradient explosion and gradient loss.…”

Section: Long and Short Term Memory Networkmentioning

confidence: 99%

Research on Safe Driving Evaluation Method Based on Machine Vision and Long Short-Term Memory Network

Shi

Tang²

2021

Journal of Electrical and Computer Engineering

View full text Add to dashboard Cite

The rapid development of transportation industry has brought some potential safety hazards. Aiming at the problem of driving safety, the application of artificial intelligence technology in safe driving behavior recognition can effectively reduce the accident rate and economic losses. Based on the presence of interference signals such as spatiotemporal background mixed signals in the driving monitoring video sequence, the recognition accuracy of small targets such as human eyes is low. In this paper, an improved dual-stream convolutional network is proposed to recognize the safe driving behavior. Based on convolutional neural networks (CNNs), attention mechanism (AM) is integrated into a long short-term memory (LSTM) neural network structure, and the hybrid dual-stream AM-LSTM convolutional network channel is designed. The spatial stream channel uses the CNN method to extract the spatial characteristic value of video image and uses pyramid pooling instead of traditional pooling, normalizing the scale transformation. The time stream channel uses a single-shot multibox detector (SSD) algorithm to calculate the adjacent two frames of video sequence for the detection of small objects such as face and eyes. Then, AM-LSTM is used to fuse and classify dual-stream information. The self-built driving behavior video image set is built. ROC, accuracy rate, and loss function experiments are carried out in the FDDB database, VOT100 data set, and self-built video image set, respectively. Compared with CNN, SSD, IDT, and dual-stream recognition methods, the accuracy rate of this method can be improved by at least 1.4%, and the average absolute error in four video sequences can be improved by more than 2%. On the contrary, in the self-built image set, the recognition rate of doze reaches 68.3%, which is higher than other methods. The experimental results show that this method has good recognition accuracy and practical application value.

show abstract

“…Crossmodal learning aims to learn the relationship between different modalities. Significant progress has been observed in visual, audio, and language modality learning, including cross-modal retrieval [29,30,31], cross-modal matching [32,33], image captioning [34,35,36], visual question answering [37,38,39], video summarization [40,41,42], etc. This paper focuses on the cross-modal learning between audio and visual modalities.…”

Section: Related Workmentioning

confidence: 99%

Audio description from image by modal translation network

et al. 2021

View full text Add to dashboard Cite

Audio is the main form for the visually impaired to obtain information. In reality, all kinds of visual data always exist, but audio data does not exist in many cases. In order to help the visually impaired people to better perceive the information around them, an image-to-audio-description (I2AD) task is proposed to generate audio descriptions from images in this paper. To complete this totally new task, a modal translation network (MT-Net) from visual to auditory sense is proposed. The proposed MT-Net includes three progressive sub-networks: 1) feature learning, 2) cross-modal mapping, and 3) audio generation. First, the feature learning sub-network aims to learn seman-

show abstract

DAA: Dual LSTMs with adaptive attention for image captioning

Cited by 30 publications

References 9 publications

An accurate generation of image captions for blind people using extended convolutional atom neural network

An accurate generation of image captions for blind people using extended convolutional atom neural network

Research on Safe Driving Evaluation Method Based on Machine Vision and Long Short-Term Memory Network

Audio description from image by modal translation network

Contact Info

Product

Resources

About