2021
DOI: 10.1016/j.neucom.2020.10.053
|View full text |Cite
|
Sign up to set email alerts
|

Audio description from image by modal translation network

Abstract: Audio is the main form for the visually impaired to obtain information. In reality, all kinds of visual data always exist, but audio data does not exist in many cases. In order to help the visually impaired people to better perceive the information around them, an image-to-audio-description (I2AD) task is proposed to generate audio descriptions from images in this paper. To complete this totally new task, a modal translation network (MT-Net) from visual to auditory sense is proposed. The proposed MT-Net includ… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
2
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 17 publications
(3 citation statements)
references
References 53 publications
0
2
0
Order By: Relevance
“…However, these methods are not suitable for large-scale RS image retrieval since they are based on low-level hand-crafted features. With the rapid development of artificial intelligent technology [22]- [26], a large number of uni-modal RS data retrieval methods based on deep learning have emerged. For instance, Tang et al [27] develop a two-stage re-ranking method to improve the retrieval performance.…”
Section: A Uni-modal Rs Data Retrieval Methodsmentioning
confidence: 99%
“…However, these methods are not suitable for large-scale RS image retrieval since they are based on low-level hand-crafted features. With the rapid development of artificial intelligent technology [22]- [26], a large number of uni-modal RS data retrieval methods based on deep learning have emerged. For instance, Tang et al [27] develop a two-stage re-ranking method to improve the retrieval performance.…”
Section: A Uni-modal Rs Data Retrieval Methodsmentioning
confidence: 99%
“…In [28], the authors proposed a modal translation network (MT-Net) from visual to auditory senses to generate audio descriptions from images. They demonstrated that the proposed model could generate intelligible audio descriptions from visual images to a good extent.…”
Section: Deep Learning For Visual-auditory Conversionmentioning
confidence: 99%
“…W ITH the high progress of remote sensing (RS) technology, RS images have shown a high-speed growth trend [1], [2]. Unearthing serviceable information from largescale RS images is very critical [3], [4]. Hence, many researchers pay attention to the research of remote sensing image retrieval (RSIR) because RSIR can quickly find effective information from large-scale RS images [5], [6].…”
Section: Introductionmentioning
confidence: 99%