2021
DOI: 10.1109/access.2021.3062752
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Corpus Design for Audio-Visual Speech Recognition in Vehicle Cabin

Abstract: This paper introduces a new methodology aimed at comfort for the driver in-the-wild multimodal corpus creation for audiovisual speech recognition in driver monitoring systems. The presented methodology is universal and can be used for corpus recording for different languages. We present an analysis of speech recognition systems and voice interfaces for driver monitoring systems based on the analysis of both audio and video data. Multimodal speech recognition allows using audio data when video data are useless … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3
1

Relationship

2
6

Authors

Journals

citations
Cited by 21 publications
(11 citation statements)
references
References 42 publications
(48 reference statements)
0
11
0
Order By: Relevance
“…Chen et al (2022) have proposed an improved K-singular value decomposition and atom optimization techniques to reduce image noise. The authors have developed audiovisual speech recognition scheme for driver monitoring system (Kashevnik et al, 2021). Multimodal speech recognition allows for the use of audio data when video data is unavailable at night, as well as the use of video data in acoustically loud environments at highways.…”
Section: Literature Reviewsmentioning
confidence: 99%
“…Chen et al (2022) have proposed an improved K-singular value decomposition and atom optimization techniques to reduce image noise. The authors have developed audiovisual speech recognition scheme for driver monitoring system (Kashevnik et al, 2021). Multimodal speech recognition allows for the use of audio data when video data is unavailable at night, as well as the use of video data in acoustically loud environments at highways.…”
Section: Literature Reviewsmentioning
confidence: 99%
“…Another modern trend that appeared recently is the web-based datasets: datasets collected from open sources such as YouTube or TV shows [ 59 ]. The most well-known of them are: LRW dataset [ 20 ], LRS2-BBC, LRS3-TED datasets [ 63 ], VGG-SOUND [ 64 ], Modality dataset [ 65 ], and vehicle AVSR [ 66 ]. A survey [ 67 ] regarding this topic provides essential knowledge of the current state-of-the-art situation.…”
Section: Related Workmentioning
confidence: 99%
“…Almost no publicly-accessible audio-visual Russian speech datasets are available and suitable for NN training. The most recent one was introduced in the work [1] and was specifically designed for the task of robust speech recognition in acoustically-noisy car environment.…”
Section: Data and Preprocessingmentioning
confidence: 99%
“…In this paper, we present lip-reading pipeline and acoustic speech recognition pipeline with the use of deep 3D CNNs. We trained and evaluated our models using RUSAVIC [1] dataset on a limited vocabulary of 50 phrases. To handle the over-fitting problem due to the increased number of parameters from the 3D kernels, we applied the idea from [2] to inflate the pre-trained weights of the several stateof-the-art models, such as MobileNetV2 [3], DenseNet121 [4], NASNetMobile [5], etc.…”
Section: Introductionmentioning
confidence: 99%