Neural network lipreading system for improved speech recognition

Stork, David G.; Wolff, Gregory J.; Levine, Evan

doi:10.1109/ijcnn.1992.226994

Cited by 74 publications

(33 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Moreover, by using a method common to both the audio and visual aspects of speech, there is the potential for a more straightforward combination of results obtained from separate audio and visual investigations and such integration has often been carried out using machine learning techniques, such as time delay neural network (TDNN) [42], support vector machines (SVM) [43] and AdaBoost [44].…”

Section: Speech Classification Based On Lip Featuresmentioning

confidence: 99%

Geometrical-based lip-reading using template probabilistic multi-dimension dynamic time warping

Ibrahim

Mulvaney

2015

Journal of Visual Communication and Image Representation

View full text Add to dashboard Cite

Section: Speech Classification Based On Lip Featuresmentioning

confidence: 99%

Geometrical-based lip-reading using template probabilistic multi-dimension dynamic time warping

Ibrahim

Mulvaney

2015

Journal of Visual Communication and Image Representation

View full text Add to dashboard Cite

“…In fixed lexicon systems, conditional independence is usually assumed at the word level, with very good results (Stork et al, 1992;Bregler et al, 1993b;Adjondani & Benoit, 1995;Movellan, 1995). While our approach does not require the assumption of conditional independence, it greatly simplifies the computations.…”

Section: Competitive Models and Robustificationmentioning

confidence: 99%

“…Recent years have seen a dramatic flourishing of the engineering literature on AVSR (Yuhas et al, 1990;Wu et al, 1991;Stork et al, 1992;Bregler et al, 1993b;Cosi et al, 1994;Bregler et al, 1994;Wolff et al, 1994;Hennecke et al, 1994;de Sa, 1994;Movellan, 1995). Current interest on AVSR is in part due to the popularization of digital multimedia tools, its potential application to automatic speech recognition in noisy environments (e.g., car telephony, airplane cockpits, noisy offices), and its links to fundamental theoretical issues in engineering and in cognitive science (Movellan & Chadderdon, 1996).…”

Section: Audio Visual Speech Recognitionmentioning

confidence: 99%

Untitled

Movellan

Mineiro

1998

Machine Learning

View full text Add to dashboard Cite

Abstract. This paper analyzes the issue of catastrophic fusion, a problem that occurs in multimodal recognition systems that integrate the output from several modules while working in non-stationary environments. For concreteness we frame the analysis with regard to the problem of automatic audio visual speech recognition (AVSR), but the issues at hand are very general and arise in multimodal recognition systems which need to work in a wide variety of contexts. Catastrophic fusion is said to have occurred when the performance of a multimodal system is inferior to the performance of some isolated modules, e.g., when the performance of the audio visual speech recognition system is inferior to that of the audio system alone. Catastrophic fusion arises because recognition modules make implicit assumptions and thus operate correctly only within a certain context. Practice shows that when modules are tested in contexts inconsistent with their assumptions, their influence on the fused product tends to increase, with catastrophic results. We propose a principled solution to this problem based upon Bayesian ideas of competitive models and inference robustification. We study the approach analytically on a classic Gaussian discrimination task and then apply it to a realistic problem on audio visual speech recognition (AVSR) with excellent results.

show abstract

“…For example, in [6] a time-delayed neural network (TDNN) is applied in an automatic lipreading system to fuse audio and visual data. In [11], another TDNN is applied to visual and audio data to detect when and where a person is speaking in a scene. A major drawback of these networks is the problem of catastrophic forgetting; i.e., learned associations from input data to output classes could be adversely influenced if the network trained online.…”

Section: Related Workmentioning

confidence: 99%

ART-Based Fusion of Multi-modal Information for Mobile Robots

Berghöfer

Schulze

Tscherepanow

et al. 2011

Engineering Applications of Neural Networks

View full text Add to dashboard Cite

Abstract. Robots operating in complex environments shared with humans are confronted with numerous problems. One important problem is the identification of obstacles and interaction partners. In order to reach this goal, it can be beneficial to use data from multiple available sources, which need to be processed appropriately. Furthermore, such environments are not static. Therefore, the robot needs to learn novel objects. In this paper, we propose a method for learning and identifying obstacles based on multi-modal information. As this approach is based on Adaptive Resonance Theory networks, it is inherently capable of incremental online learning.

show abstract

Neural network lipreading system for improved speech recognition

Cited by 74 publications

References 21 publications

Geometrical-based lip-reading using template probabilistic multi-dimension dynamic time warping

Geometrical-based lip-reading using template probabilistic multi-dimension dynamic time warping

Untitled

ART-Based Fusion of Multi-modal Information for Mobile Robots

Contact Info

Product

Resources

About