The current accuracy of speech recognition has been able to reach over 97% on different data sets, but the accuracy of speech recognition in noisy environments is greatly reduced. Improving speech recognition performance in noisy environments is a challenging task. Due to the fact that visual information is not affected by noise, researchers often use lip information to help improve speech recognition performance. This is where the performance of lip reading and the effect of cross-modal fusion are particularly important. In this paper, we try to improve the accuracy of speech recognition in noisy environments by improving the lip reading performance and the cross-modal fusion effect. First, due to the same lip may contain multiple meanings, we construct a one-to-many mapping relationship model between lips and speech, allowing the lip-reading model to consider the feasibility of which articulations are represented from the input lip movements. Also, audio representations are preserved by modeling the inter-relationships between paired audio-visual representations. At the inference stage, the preserved audio representations can be extracted from memory by the learned interrelationships using only video input. Second, a joint cross-fusion model using the attention mechanism can effectively exploit complementary inter-modal relationships, and the model calculates cross-attention weights based on the correlations between joint feature representations and individual modalities. Finally, our proposed model has a 4.0% reduction in WER in −15 dB SNR environment compared to the baseline method, and a 10.1% reduction in WER compared to speech recognition. The experimental results show that our method has a significant improvement over speech recognition models in different noise environments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.