Multi-Modal Image Captioning for the Visually Impaired

Ahsan, Hiba; Bhalla, Nikita; Bhatt, Daivat; Shah, Kaivankumar

doi:10.48550/arxiv.2105.08106

Cited by 2 publications

(2 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The problem of learning small bits of information in visual feature encoding is solved by using these regional feature encoding methods. [56] introduces a modification to the image captioning model Attention on Attention Network (AoANet) that makes use of the text that has been recognised in the picture as an input feature. Additionally, when exact reproduction of tokens is required, they employ a pointer-generator system to transfer the detected text to the caption.…”

Section: Deep Learning Approaches For Image Captioningmentioning

confidence: 99%

A Comprehensive Survey on Image Captioning for Indian Languages: Techniques, Datasets, and Challenges

Jayaswal,

Rani,

Kaur

2023

Preprint

View full text Add to dashboard Cite

In image captioning, we generate visual descriptions from an image. Image Cap-tioning requires identifying the key entity, feature, and association in an image. There is also a requirement to generate captions that are syntactically and semantically correct. The process of image captioning requires computer vision and natural language processing. In the past few decades, a substantial attempt has been made to generate the caption for images. In this survey article, we are going to present an extensive survey on image captioning for Indian Languages. To summarize recent research work in image captioning, first, we briefly review the traditional approach to image captioning depending on template and retrieval. Further deep-learning approaches for image captioning are concentrated which are classified as encoder-decoder architecture, attention-based approach, and transformer architecture. Our main focus in this survey is based on image cap-tioning techniques for Indian languages like Hindi, Bengali Assamese, etc. After that, we analyze the state-of-the-art approach on the most widely dataset i.e. MS COCO dataset with their strengths, limitations, and performance metrics i.e. BLEU, ROUGE, METEOR, CIDEr, SPICE. At last, we explore discussion on open challenges and future direction in the field of image captioning.

show abstract

Section: Deep Learning Approaches For Image Captioningmentioning

confidence: 99%

A Comprehensive Survey on Image Captioning for Indian Languages: Techniques, Datasets, and Challenges

Jayaswal,

Rani,

Kaur

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…To ensure sustainable transportation, big data analysis through methods such as [1] and data support through IoT data security transmission technologies such as [2] are used to provide decisions for transportation planning. Deep learning also has an indelible role in this, and natural language description of traffic scenes is important for assisting visually impaired people in their daily lives and in participating in traffic [3,4], as well as generating rich semantic information for drivers, thus assisting in the generation of intelligent decision suggestions, reducing driver decision time, and being important for reducing the risk of accidents [5]. This maintains the resilience of traffic as well as the sustainability of traffic by ensuring road safety.…”

Section: Introductionmentioning

confidence: 99%

Traffic Scene Captioning with Multi-Stage Feature Enhancement

Zhang,

Ma,

Liu

et al. 2023

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

Traffic scene captioning technology automatically generates one or more sentences to describe the content of traffic scenes by analyzing the content of the input traffic scene images, ensuring road safety while providing an important decision-making function for sustainable transportation. In order to provide a comprehensive and reasonable description of complex traffic scenes, a traffic scene semantic captioning model with multi-stage feature enhancement is proposed in this paper. In general, the model follows an encoder-decoder structure. First, multilevel granularity visual features are used for feature enhancement during the encoding process, which enables the model to learn more detailed content in the traffic scene image. Second, the scene knowledge graph is applied to the decoding process, and the semantic features provided by the scene knowledge graph are used to enhance the features learned by the decoder again, so that the model can learn the attributes of objects in the traffic scene and the relationships between objects to generate more reasonable captions. This paper reports extensive experiments on the challenging MS-COCO dataset, evaluated by five standard automatic evaluation metrics, and the results show that the proposed model has improved significantly in all metrics compared with the state-of-the-art methods, especially achieving a score of 129.0 on the CIDEr-D evaluation metric, which also indicates that the proposed model can effectively provide a more reasonable and comprehensive description of the traffic scene.

show abstract

Multi-Modal Image Captioning for the Visually Impaired

Cited by 2 publications

References 14 publications

A Comprehensive Survey on Image Captioning for Indian Languages: Techniques, Datasets, and Challenges

A Comprehensive Survey on Image Captioning for Indian Languages: Techniques, Datasets, and Challenges

Traffic Scene Captioning with Multi-Stage Feature Enhancement

Contact Info

Product

Resources

About