Improving Speech Recognition Performance in Noisy Environments by Enhancing Lip Reading Accuracy

Li, Dengshi; Gao, Yu; Zhu, Chenyi; Wang, Qianrui; Wang, Ruoxi

doi:10.3390/s23042053

Cited by 10 publications

(5 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, individual differences in co-articulation may underlie difficulties in the transfer of training of visual speech recognition (Bear & Harvey, 2017). Future interactions between psychologists and computer scientists studying multimodal speech recognition could facilitate identification of specific targets for training people and/or computers, perhaps even leading to a new generation of 'smart' hearing aids that use lipreading to enhance automatic speech recognition in noisy environments (e.g., Li et al, 2023).…”

Section: Discussionmentioning

confidence: 99%

Assessing Visual Speech Communication Abilities: Cross-situational Consistency in Lipreadability

Myerson,

Spehar,

Strube

et al. 2024

Preprint

View full text Add to dashboard Cite

Objective: Does the accuracy with which an individual can receive visual speech information reflect one psychometric ability while the accuracy with which they can transmit visual speech information reflect a different ability? This question has not been directly addressed in previous studies, but if so, it would have important theoretical, methodological, and clinical implications because of the interaction between the accuracy of auditory and visual speech recognition in noisy situations. From a psychometric perspective, an ability is distinguished by a high degree of cross-situational consistency in the pattern of individual differences. Therefore, the present investigation focused on whether individuals were consistently good or bad at recognizing vision-only speech as well as whether they were consistently good or bad at accurately transmitting visual speech information. Design: Round-robin experimental designs, in which each participant in a group lipreads everyone else in their group, provide an efficient way to simultaneously measure individual differences in the consistency in individuals’ lipreading accuracy across talkers as well as the consistency with which different senders (i.e., talkers) transmitted accurate speech information to different receivers (i.e., lipreaders). Accordingly, the present investigation analyzed data from two groups in a round-robin study in order to assess the degree of consistency in both lipreading ability (i.e., the accuracy with which different participants lipread others in their group) and lipreadability (i.e., how accurately individual talkers could be lipread by the others in their group). Results: In both groups, very strong correlations (mean rs = .867 and .897) among the accuracy with which different individuals’ lipread different talkers demonstrated that the same individual participants were consistently good (or poor) at lipreading regardless of who the talker was. Consistent with the hypothesis that the ability to transmit visual speech information is also a psychometric ability, additional strong correlations (mean rs = .645 and .842) revealed that in both groups the same individual talkers were consistently lipread accurately or inaccurately regardless of who was doing the lipreading. There was no evidence that participants’ lipreadability (production) was related to their l.lipreading ability (comprehension).Conclusions: The present findings show that the effectiveness of visual speech communication depends on two separate psychometric abilities: receivers’ lipreading ability and senders’ lipreadability. Together, these abilities determine the accuracy with which speech information is communicated from senders to receivers, particularly in noisy situations, and the degree of communication possible between a specific sender and a specific receiver of speech information. In other words, in some situations what people describe as ‘hearing problems’ might be better described as problems with audiovisual communication. Identification and assessment of the specific nature of these problems may make it possible to more accurately target and potentially remediate the communication problems people experience in everyday life.

show abstract

Section: Discussionmentioning

confidence: 99%

Assessing Visual Speech Communication Abilities: Cross-situational Consistency in Lipreadability

Myerson,

Spehar,

Strube

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“… Forensics: lip-reading can be used to reconstruct the dialogues in a footage where the audio has been lost or it is noisy.  Automated Speech Recognition [1]: automakers can integrate lip-reading systems to complement their ASR model in order to understand commands (for example "turn on the A/C") from the driver or the passengers in smart cars when the music's volume is too high. Lip-reading is also necessary in this case to recognize the active speaker in the scene.…”

Section: Introductionmentioning

confidence: 99%

Biglip: A Pipeline for Building Data Sets for Lip-Reading

Jamil

2024

Machine Learning, IOT and Blockchain

View full text Add to dashboard Cite

Lip-reading, the process of deciphering text from visual mouth movements, has garnered significant research attention. While numerous data sets exist for training lip-reading models, their coverage of diverse languages remains limited. In this paper, we introduce an innovative pipeline for constructing data sets tailored to lipreading models, leveraging web-based videos. Notably, this pipeline is the first of its kind to be made publicly available. By employing this pipeline, we successfully compiled a data set comprising Italian videos—a previously unexplored language for lipreading research. Subsequently, we utilized this data set to train two lip-reading models, thereby highlighting the strengths and weaknesses of employing wildsourced videos (e.g., from YouTube) for lip-reading model training. The proposed pipeline encompasses modules for audio-video synchronization, audio transcription, alignment, cleaning, and facilitates the creation of extensive training data with minimal supervision. By presenting this pipeline, we aim to encourage further advancements in lip-reading research, specifically in the domain of multilingual data sets, thus fostering more comprehensive and inclusive lip-reading models.

show abstract

“…Multimodal deep learning has emerged as a powerful approach for various tasks by combining information from different modalities, exploiting their complementary nature, and enhancing their overall performance [1][2][3][4]. In the realm of speaker recognition, incorporating multiple features, such as lip movements, depth images, and voice, can lead to improved accuracy and robustness in applications such as security systems, access control, and surveillance [2,[5][6][7][8][9].…”

Section: Introductionmentioning

confidence: 99%

Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification

Moufidi

Rousseau

Rasti

2023

Sensors

View full text Add to dashboard Cite

Multimodal deep learning, in the context of biometrics, encounters significant challenges due to the dependence on long speech utterances and RGB images, which are often impractical in certain situations. This paper presents a novel solution addressing these issues by leveraging ultrashort voice utterances and depth videos of the lip for person identification. The proposed method utilizes an amalgamation of residual neural networks to encode depth videos and a Time Delay Neural Network architecture to encode voice signals. In an effort to fuse information from these different modalities, we integrate self-attention and engineer a noise-resistant model that effectively manages diverse types of noise. Through rigorous testing on a benchmark dataset, our approach exhibits superior performance over existing methods, resulting in an average improvement of 10%. This method is notably efficient for scenarios where extended utterances and RGB images are unfeasible or unattainable. Furthermore, its potential extends to various multimodal applications beyond just person identification.

show abstract

Improving Speech Recognition Performance in Noisy Environments by Enhancing Lip Reading Accuracy

Cited by 10 publications

References 33 publications

Assessing Visual Speech Communication Abilities: Cross-situational Consistency in Lipreadability

Assessing Visual Speech Communication Abilities: Cross-situational Consistency in Lipreadability

Biglip: A Pipeline for Building Data Sets for Lip-Reading

Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification

Contact Info

Product

Resources

About