2021
DOI: 10.48550/arxiv.2105.08106
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multi-Modal Image Captioning for the Visually Impaired

Abstract: One of the ways blind people understand their surroundings is by clicking images and relying on descriptions generated by image captioning systems. Current work on captioning images for the visually impaired do not use the textual data present in the image when generating captions. This problem is critical as many visual scenes contain text. Moreover, up to 21% of the questions asked by blind people about the images they click pertain to the text present in them (Bigham et al., 2010). In this work, we propose … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 14 publications
0
2
0
Order By: Relevance
“…The problem of learning small bits of information in visual feature encoding is solved by using these regional feature encoding methods. [56] introduces a modification to the image captioning model Attention on Attention Network (AoANet) that makes use of the text that has been recognised in the picture as an input feature. Additionally, when exact reproduction of tokens is required, they employ a pointer-generator system to transfer the detected text to the caption.…”
Section: Deep Learning Approaches For Image Captioningmentioning
confidence: 99%
“…The problem of learning small bits of information in visual feature encoding is solved by using these regional feature encoding methods. [56] introduces a modification to the image captioning model Attention on Attention Network (AoANet) that makes use of the text that has been recognised in the picture as an input feature. Additionally, when exact reproduction of tokens is required, they employ a pointer-generator system to transfer the detected text to the caption.…”
Section: Deep Learning Approaches For Image Captioningmentioning
confidence: 99%
“…To ensure sustainable transportation, big data analysis through methods such as [1] and data support through IoT data security transmission technologies such as [2] are used to provide decisions for transportation planning. Deep learning also has an indelible role in this, and natural language description of traffic scenes is important for assisting visually impaired people in their daily lives and in participating in traffic [3,4], as well as generating rich semantic information for drivers, thus assisting in the generation of intelligent decision suggestions, reducing driver decision time, and being important for reducing the risk of accidents [5]. This maintains the resilience of traffic as well as the sustainability of traffic by ensuring road safety.…”
Section: Introductionmentioning
confidence: 99%