Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

This article provides a detailed review of recent advances in audio-visual speech recognition (AVSR) methods that have been developed over the last decade (2013–2023). Despite the recent success of audio speech recognition systems, the problem of audio-visual (AV) speech decoding remains challenging. In comparison to the previous surveys, we mainly focus on the important progress brought with the introduction of deep learning (DL) to the field and skip the description of long-known traditional “hand-crafted” methods. In addition, we also discuss the recent application of DL toward AV speech fusion and recognition. We first discuss the main AV datasets used in the literature for AVSR experiments since we consider it a data-driven machine learning (ML) task. We then consider the methodology used for visual speech recognition (VSR). Subsequently, we also consider recent AV methodology advances. We then separately discuss the evolution of the core AVSR methods, pre-processing and augmentation techniques, and modality fusion strategies. We conclude the article with a discussion on the current state of AVSR and provide our vision for future research.

show abstract

“…In Ref. [117] (No. 9, Table 3), the authors achieved 85.00% lip-reading accuracy by making use of unlabelled unimodal data.…”

Section: Lrw Methodology: Visual Speech Recognitionmentioning

confidence: 99%

“…The CTC loss assumes conditional independence between each output prediction and has a form of [117] p CTC (y…”

Section: Lrw Methodology: Visual Speech Recognitionmentioning

confidence: 99%

A Review of Recent Advances on Deep Learning Methods for Audio-Visual Speech Recognition

2023

View full text Add to dashboard Cite

show abstract

“…Visual frontend serves as a component to capture the lip motion and reflect the lip position differences in its output representations. Here, we have followed the same procedures as Xichen Pan [ 21 ]. We have truncated the first convolutional layer in MoCo v2, which was pre-trained on ImageNet, and replaced it with a 3D convolutional layer.…”

Section: Architecture and Methodsmentioning

confidence: 99%

“…For audio-only methods, we used the same LU-SSL transformer method proposed by Xichen Pan et al [ 21 ] in 2022, so the WER is consistent with that method, with an error rate of only 2.7%, which is currently the best achieved on the LRS2 dataset.…”

Section: Methodsmentioning

confidence: 99%

“…In 2021, Pingchuan Ma et al [ 7 ] made end-to-end learning on LRS2 a possibility by using a conformer acoustic model and a hybrid CTC/attention decoder, achieving even better recognition results. Xichen Pan et al [ 21 ] used two single-modal self-supervised modules of wav2vec and MoCo for cross-modal self-supervised audiovisual speech recognition, which achieved the best performance so far on the LRS2 [ 11 ] dataset without using language models, with a WER of only 2.7% for speech recognition and 2.6% for audiovisual speech recognition.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Improving Speech Recognition Performance in Noisy Environments by Enhancing Lip Reading Accuracy

Gao

Zhu

et al. 2023

Sensors

View full text Add to dashboard Cite

The current accuracy of speech recognition has been able to reach over 97% on different data sets, but the accuracy of speech recognition in noisy environments is greatly reduced. Improving speech recognition performance in noisy environments is a challenging task. Due to the fact that visual information is not affected by noise, researchers often use lip information to help improve speech recognition performance. This is where the performance of lip reading and the effect of cross-modal fusion are particularly important. In this paper, we try to improve the accuracy of speech recognition in noisy environments by improving the lip reading performance and the cross-modal fusion effect. First, due to the same lip may contain multiple meanings, we construct a one-to-many mapping relationship model between lips and speech, allowing the lip-reading model to consider the feasibility of which articulations are represented from the input lip movements. Also, audio representations are preserved by modeling the inter-relationships between paired audio-visual representations. At the inference stage, the preserved audio representations can be extracted from memory by the learned interrelationships using only video input. Second, a joint cross-fusion model using the attention mechanism can effectively exploit complementary inter-modal relationships, and the model calculates cross-attention weights based on the correlations between joint feature representations and individual modalities. Finally, our proposed model has a 4.0% reduction in WER in −15 dB SNR environment compared to the baseline method, and a 10.1% reduction in WER compared to speech recognition. The experimental results show that our method has a significant improvement over speech recognition models in different noise environments.

show abstract

Research on Robust Audio-Visual Speech Recognition Algorithms

Yang

et al. 2023

Mathematics

View full text Add to dashboard Cite

Automatic speech recognition (ASR) that relies on audio input suffers from significant degradation in noisy conditions and is particularly vulnerable to speech interference. However, video recordings of speech capture both visual and audio signals, providing a potent source of information for training speech models. Audiovisual speech recognition (AVSR) systems enhance the robustness of ASR by incorporating visual information from lip movements and associated sound production in addition to the auditory input. There are many audiovisual speech recognition models and systems for speech transcription, but most of them have been tested based in a single experimental setting and with a limited dataset. However, a good model should be applicable to any scenario. Our main contributions are: (i) Reproducing the three best-performing audiovisual speech recognition models in the current AVSR research area using the most famous audiovisual databases, LSR2 (Lip Reading Sentences 2) LSR3 (Lip Reading Sentences 3), and comparing and analyzing their performances under various noise conditions. (ii) Based on our experimental and research experiences, we analyzed the problems currently encountered in the AVSR domain, which are summarized as the feature-extraction problem and the domain-generalization problem. (iii) According to the experimental results, the Moco (momentum contrast) + word2vec (word to vector) model has the best AVSR effect on the LRS datasets regardless of whether there is noise or not. Additionally, the model also produced the best experimental results in the experiments of audio recognition and video recognition. Our research lays the foundation for further improving the performance of AVSR models.

show abstract

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Cited by 3 publications

References 22 publications

A Review of Recent Advances on Deep Learning Methods for Audio-Visual Speech Recognition

A Review of Recent Advances on Deep Learning Methods for Audio-Visual Speech Recognition

Improving Speech Recognition Performance in Noisy Environments by Enhancing Lip Reading Accuracy

Research on Robust Audio-Visual Speech Recognition Algorithms

Contact Info

Product

Resources

About