Multimodal Corpus Design for Audio-Visual Speech Recognition in Vehicle Cabin

Kashevnik, Alexey; Lashkov, Igor; Axyonov, Alexandr; Ivanko, Denis; Ryumin, Dmitry; Kolchin, Artem; Karpov, Alexey

doi:10.1109/access.2021.3062752

Cited by 21 publications

(11 citation statements)

References 42 publications

(48 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Chen et al (2022) have proposed an improved K-singular value decomposition and atom optimization techniques to reduce image noise. The authors have developed audiovisual speech recognition scheme for driver monitoring system (Kashevnik et al, 2021). Multimodal speech recognition allows for the use of audio data when video data is unavailable at night, as well as the use of video data in acoustically loud environments at highways.…”

Section: Literature Reviewsmentioning

confidence: 99%

RETRACTED ARTICLE: Audio-Visual Automatic Speech Recognition Towards Education for Disabilities

Debnath

Roy

Namasudra

et al. 2022

J Autism Dev Disord

View full text Add to dashboard Cite

Education is a fundamental right that enriches everyone's life. However, physically challenged people often debar from the general and advanced education system. Audio-Visual Automatic Speech Recognition (AV-ASR) based system is useful to improve the education of physically challenged people by providing hands-free computing. They can communicate to the learning system through AV-ASR. However, it is challenging to trace the lip correctly for visual modality. Thus, this paper addresses the appearance-based visual feature along with the co-occurrence statistical measure for visual speech recognition. Local Binary Pattern-Three Orthogonal Planes (LBP-TOP) and Grey-Level Co-occurrence Matrix (GLCM) is proposed for visual speech information. The experimental results show that the proposed system achieves 76.60 % accuracy for visual speech and 96.00 % accuracy for audio speech recognition.

show abstract

Section: Literature Reviewsmentioning

confidence: 99%

RETRACTED ARTICLE: Audio-Visual Automatic Speech Recognition Towards Education for Disabilities

Debnath

Roy

Namasudra

et al. 2022

J Autism Dev Disord

View full text Add to dashboard Cite

show abstract

“…Another modern trend that appeared recently is the web-based datasets: datasets collected from open sources such as YouTube or TV shows [ 59 ]. The most well-known of them are: LRW dataset [ 20 ], LRS2-BBC, LRS3-TED datasets [ 63 ], VGG-SOUND [ 64 ], Modality dataset [ 65 ], and vehicle AVSR [ 66 ]. A survey [ 67 ] regarding this topic provides essential knowledge of the current state-of-the-art situation.…”

Section: Related Workmentioning

confidence: 99%

Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices

Ryumin

Ivanko

Ryumina

2023

Sensors

Self Cite

View full text Add to dashboard Cite

Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise. Additional visual information can be used for both automatic lip-reading and gesture recognition. Hand gestures are a form of non-verbal communication and can be used as a very important part of modern human–computer interaction systems. Currently, audio and video modalities are easily accessible by sensors of mobile devices. However, there is no out-of-the-box solution for automatic audio-visual speech and gesture recognition. This study introduces two deep neural network-based model architectures: one for AVSR and one for gesture recognition. The main novelty regarding audio-visual speech recognition lies in fine-tuning strategies for both visual and acoustic features and in the proposed end-to-end model, which considers three modality fusion approaches: prediction-level, feature-level, and model-level. The main novelty in gesture recognition lies in a unique set of spatio-temporal features, including those that consider lip articulation information. As there are no available datasets for the combined task, we evaluated our methods on two different large-scale corpora—LRW and AUTSL—and outperformed existing methods on both audio-visual speech recognition and gesture recognition tasks. We achieved AVSR accuracy for the LRW dataset equal to 98.76% and gesture recognition rate for the AUTSL dataset equal to 98.56%. The results obtained demonstrate not only the high performance of the proposed methodology, but also the fundamental possibility of recognizing audio-visual speech and gestures by sensors of mobile devices.

show abstract

“…Almost no publicly-accessible audio-visual Russian speech datasets are available and suitable for NN training. The most recent one was introduced in the work [1] and was specifically designed for the task of robust speech recognition in acoustically-noisy car environment.…”

Section: Data and Preprocessingmentioning

confidence: 99%

“…In this paper, we present lip-reading pipeline and acoustic speech recognition pipeline with the use of deep 3D CNNs. We trained and evaluated our models using RUSAVIC [1] dataset on a limited vocabulary of 50 phrases. To handle the over-fitting problem due to the increased number of parameters from the 3D kernels, we applied the idea from [2] to inflate the pre-trained weights of the several stateof-the-art models, such as MobileNetV2 [3], DenseNet121 [4], NASNetMobile [5], etc.…”

Section: Introductionmentioning

confidence: 99%

Development of Visual and Audio Speech Recognition Systems Using Deep Neural Networks

Ivanko

Ryumin

2021

Proceedings of the 31th International Conference on Computer Graphics and Vision. Volume 2

Self Cite

View full text Add to dashboard Cite

In this paper we design end-to-end neural network for the low-resource lip-reading task and audio speech recognition task using 3D CNNs, pre-trained CNN weights of several state-of- the-art models (e.g. VGG19, InceptionV3, MobileNetV2, etc.) and LSTMs. We present two phrase-level speech recognition pipelines: for lip-reading and acoustic speech recognition. We evaluate different combinations of front-end and back-end modules on the RUSAVIC dataset. We compare our results with traditional 2D CNN approach and demonstrate the increase in recognition accuracy up to 14%. Moreover, we carefully studied existing state-of-the-art models to be use for augmentation. Based on the conducted analysis we have chosen 5 most promising model’s architectures and evaluated them on own data. We have tested our systems on a real-word data of two different scenarios: recorded in idling vehicle and during actual driving. Our independently trained systems demonstrated acoustic speech accuracy up to 90% and lip-reading accuracy up to 61%. Future work will focus on the fusion of visual and audio speech modalities and on speaker adaptation. We expect that fused multi-modal information will help to further improve recognition performance compared to a single modality. Another possible direction could be the research of different NN-based architectures to better tackle end-to-end lip-reading task.

show abstract

Multimodal Corpus Design for Audio-Visual Speech Recognition in Vehicle Cabin

Cited by 21 publications

References 42 publications

RETRACTED ARTICLE: Audio-Visual Automatic Speech Recognition Towards Education for Disabilities

RETRACTED ARTICLE: Audio-Visual Automatic Speech Recognition Towards Education for Disabilities

Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices

Development of Visual and Audio Speech Recognition Systems Using Deep Neural Networks

Contact Info

Product

Resources

About