Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face

Tong, Shan; Wenner, Casper E; Xu, Chenliang; Duan, Zhiyao; Maddox, Ross K.

doi:10.1177/23312165221136934

Cited by 5 publications

(12 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our study replicates decades of research by showing that seeing the face of a real talker improves speech-in-noise perception (Peelle and Sommers, 2015; Sumby and Pollack, 1954). Our study also confirms two recent reports that viewing a synthetic face generated by a deep neural network (DNN) significantly improves speech-in-noise perception (Shan et al, 2022; Varano et al, 2022). Both the present study and these previous reports found that the improvement from viewing DNN faces was only about half that provided by viewing real faces.…”

Section: Discussionsupporting

confidence: 91%

“…For comparison, the same pairing with a real visual face evoked the percept of /v/ on 94% of trials (Dias et al, 2016; Shahin, 2019). Taken together, this indicates that for incongruent auditory-visual speech, synthetic faces influenced perception much less than real faces, consistent with the real−synthetic difference for speech-in-noise observed in the present study and (Shan et al, 2022; Varano et al, 2022).…”

Section: Discussionsupporting

confidence: 90%

“…In order to maximize the number of tested words and minimize experimental time, only a single noise level was tested, as in a previous study of DNN faces (Varano et al, 2022), with a high level of noise selected to maximize the benefit of visual speech (Rennig et al, 2020). Another previous study of DNN faces tested multiple noise levels and found a lawful relationship between different noise levels and perception (Shan et al, 2022). As the amount of added auditory noise decreased, accuracy increased for the no-face, real face and DNN face conditions in parallel, converging at ceiling accuracy for all three conditions when no auditory noise was added.…”

Section: Limitations Of the Present Studymentioning

confidence: 99%

See 2 more Smart Citations

The Effect on Speech-in-Noise Perception of Real Faces and Synthetic Faces Generated with either Deep Neural Networks or the Facial Action Coding System

Yu,

Lado,

Zhang

et al. 2024

Preprint

View full text Add to dashboard Cite

The prevalence of synthetic talking faces in both commercial and academic environments is increasing as the technology to generate them grows more powerful and available. While it has long been known that seeing the face of the talker improves human perception of speech-in-noise, recent studies have shown that synthetic talking faces generated by deep neural networks (DNNs) are also able to improve human perception of speech-in-noise. However, in previous studies the benefit provided by DNN synthetic faces was only about half that of real human talkers. We sought to determine whether synthetic talking faces generated by an alternative method would provide a greater perceptual benefit. The facial action coding system (FACS) is a comprehensive system for measuring visually discernible facial movements. Because the action units that comprise FACS are linked to specific muscle groups, synthetic talking faces generated by FACS might have greater verisimilitude than DNN synthetic faces which do not reference an explicit model of the facial musculature. We tested the ability of human observers to identity speech-in-noise accompanied by a blank screen; the real face of the talker; and synthetic talking face generated either by DNN or FACS. We replicated previous findings of a large benefit for seeing the face of a real talker for speech-in-noise perception and a smaller benefit for DNN synthetic faces. FACS faces also improved perception, but only to the same degree as DNN faces. Analysis at the phoneme level showed that the performance of DNN and FACS faces was particularly poor for phonemes that involve interactions between the teeth and lips, such as /f/, /v/, and /th/. Inspection of single video frames revealed that the characteristic visual features for these phonemes were weak or absent in synthetic faces. Modeling the realvs.synthetic difference showed that increasing the realism of a few phonemes could substantially increase the overall perceptual benefit of synthetic faces, providing a roadmap for improving communication in this rapidly developing domain.

show abstract

Section: Discussionsupporting

confidence: 91%

Section: Discussionsupporting

confidence: 90%

Section: Limitations Of the Present Studymentioning

confidence: 99%

See 1 more Smart Citation

The Effect on Speech-in-Noise Perception of Real Faces and Synthetic Faces Generated with either Deep Neural Networks or the Facial Action Coding System

Yu,

Lado,

Zhang

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…The ability to rapidly generate a synthetic face saying arbitrary words suggests the possibility of an “audiovisual hearing aid” that displays a synthetic talking face to improve comprehension. This possibility received support from two recent studies that used deep neural networks (DNNs) to generate realistic, synthetic talking faces ( Shan et al, 2022 ; Varano et al, 2022 ). Both studies found that viewing synthetic faces significantly improved speech-in-noise perception, but the benefit was only about half as much as viewing a real human talker.…”

Section: Introductionmentioning

confidence: 99%

“…To test this idea, we undertook a behavioral study to compare the perception of speech-in-noise on its own; speech-in-noise with real faces (to serve as a benchmark); and speech-in-noise presented with two types of synthetic faces. The first synthetic face type was generated by a deep neural network, as in the studies of ( Shan et al, 2022 ; Varano et al, 2022 ). The second synthetic face type was generated using FACS, as implemented in the commercial software package JALI ( Edwards et al, 2016 ; Zhou et al, 2018 ).…”

Section: Introductionmentioning

confidence: 99%

Synthetic faces generated with the facial action coding system or deep neural networks improve speech-in-noise perception, but not as much as real faces

Yu,

Lado,

Zhang

et al. 2024

Front. Neurosci.

View full text Add to dashboard Cite

The prevalence of synthetic talking faces in both commercial and academic environments is increasing as the technology to generate them grows more powerful and available. While it has long been known that seeing the face of the talker improves human perception of speech-in-noise, recent studies have shown that synthetic talking faces generated by deep neural networks (DNNs) are also able to improve human perception of speech-in-noise. However, in previous studies the benefit provided by DNN synthetic faces was only about half that of real human talkers. We sought to determine whether synthetic talking faces generated by an alternative method would provide a greater perceptual benefit. The facial action coding system (FACS) is a comprehensive system for measuring visually discernible facial movements. Because the action units that comprise FACS are linked to specific muscle groups, synthetic talking faces generated by FACS might have greater verisimilitude than DNN synthetic faces which do not reference an explicit model of the facial musculature. We tested the ability of human observers to identity speech-in-noise accompanied by a blank screen; the real face of the talker; and synthetic talking faces generated either by DNN or FACS. We replicated previous findings of a large benefit for seeing the face of a real talker for speech-in-noise perception and a smaller benefit for DNN synthetic faces. FACS faces also improved perception, but only to the same degree as DNN faces. Analysis at the phoneme level showed that the performance of DNN and FACS faces was particularly poor for phonemes that involve interactions between the teeth and lips, such as /f/, /v/, and /th/. Inspection of single video frames revealed that the characteristic visual features for these phonemes were weak or absent in synthetic faces. Modeling the real vs. synthetic difference showed that increasing the realism of a few phonemes could substantially increase the overall perceptual benefit of synthetic faces.

show abstract

The Noisy Encoding of Disparity Model Predicts Perception of the McGurk Effect in Native Japanese Speakers

Magnotti,

Lado,

Beauchamp

2024

Preprint

View full text Add to dashboard Cite

The McGurk effect is an illusion that demonstrates the influence of information from the face of the talker on the perception of auditory speech. The diversity of human languages has prompted many intercultural studies of the effect, including in native Japanese speakers. Studies of large samples of native English speakers have shown that the McGurk effect is characterized by high variability, both in the susceptibility of different individuals to the illusion and in the frequency with which different experimental stimuli induce the illusion. The noisy encoding of disparity (NED) model of the McGurk effect uses Bayesian principles to account for this variability by separately estimating the susceptibility and sensory noise for each individual and the strength of each stimulus. To test whether the NED model could account for McGurk perception in a non-Western culture, we applied it to data collected from 80 native Japanese-speaking participants. Fifteen different McGurk stimuli were presented, along with audiovisual congruent stimuli. The McGurk effect was highly variable across stimuli and participants, with the percentage of illusory fusion responses ranging from 3% to 78% across stimuli and from 0% to 91% across participants. Despite this variability, the NED model accurately predicted perception, predicting fusion rates for individual stimuli with 2.1% error and for individual participants with 2.4% error. Stimuli containing the unvoiced pa/ka pairing evoked more fusion responses than the voiced ba/ga pairing. Model estimates of sensory noise was correlated with participant age, with greater sensory noise in older participants. The NED model of the McGurk effect offers a principled way to account for individual and stimulus differences when examining the McGurk effect within and across cultures.

show abstract

Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face

Cited by 5 publications

References 53 publications

The Effect on Speech-in-Noise Perception of Real Faces and Synthetic Faces Generated with either Deep Neural Networks or the Facial Action Coding System

The Effect on Speech-in-Noise Perception of Real Faces and Synthetic Faces Generated with either Deep Neural Networks or the Facial Action Coding System

Synthetic faces generated with the facial action coding system or deep neural networks improve speech-in-noise perception, but not as much as real faces

The Noisy Encoding of Disparity Model Predicts Perception of the McGurk Effect in Native Japanese Speakers

Contact Info

Product

Resources

About