We designed and trained a modified time-delay neural network (TDNN) to perform both automatic lipreading ("speech reading") in conjunction with acoustic speech recognition in order to improve recognition both in silent environments as well as in the presence of acoustic noise. The speech reader subsystem has a speaker-independent recognition accuracy of 51% (in the absence of acoustic information); the combined acoustic-visual system has a recognition accuracy of 9 1 %, all on a ten-utterance speakerindependent task. Most importantly, with no free parameters, our system is far more robust to acoustic noise and verbal distractors than is a system not incorporaQng visual information. Specifically, in the presence of high amplitude pink noise the low recognition rate in our acoustic only system (43%) is raised dramatically to 75% by the incorporation of visual information, Additionally, our system responds to (artificial) conflicting cross-modal patterns in a way closely analogous to the McGurk effect in humans.We thus demonstrate the power of neural techniques in several crucial and difficult domains: 1) pattern recognition, 2) sensory integration, and 3) distributed approaches toward "rule-based" (linguistic-phonological) processing. Our results suggest that speech reading systems may find use in a vast array of real-world situations, for instance high noise environments such as factory and shop floors, cockpits, large office environments, outdoor public spaces, and so on.
AbstractWe designed and trained a modified time-delay neural network (TDNN) to perform both automatic lipreading ("speech reading") and acoustic speech recognition in order to improve recognition both in silent environments as well as in the presence of acoustic noise. The speech reader subsystem has a speaker-independent recognition accuracy of 51% (in the absence of acoustic information); the combined acoustic-visual system has a recogntion accuracy of 9 1 %, all on a ten-utterance speaker-independent task. Most importantly, with no free parameters, our system is far more robust to acoustic noise and verbal distractors than is a system not incorporating visual information. Specifically, in the presence of high amplitude pink noise the low recognition rate in an acoustic only system (43%) is raised dramatically to 75% by the incorporation of visual information. Our system responds to (artificial) conflicting cross-modal patterns in a way closely analogous to the McGurk effect in humans.We thus demonstrate the power of neural techniques in several crucial and difficult domains: 1) pattern recognition, 2) sensory integration, and 3) distributed approaches toward "rule-based" (linguistic-phonological) processing. Our results suggest that speech reading systems may find use in a vast array of real-world situations, for instance high noise environments such as factory and shop floors, cockpits, large office environments, outdoor public spaces, and so on.
11-292T 11-294 T
scite is a Brooklyn-based startup that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.