Abstract-Visual speech information from the speaker's mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability into the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audio-visual speech integration. On the later topic, we discuss new work on feature and decision fusion combination, the modeling of audio-visual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audiovisual speaker adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small-to largevocabulary recognition tasks, recorded at both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, however less so for visually challenging environments and large vocabulary tasks.
Compelling public interest is propelling national efforts to advance the evidence base for cancer treatment and control measures and to transform the way in which evidence is aggregated and applied. Substantial investments in health information technology, comparative effectiveness research, health care quality and value, and personalized medicine support these efforts and have resulted in considerable progress to date. An emerging initiative, and one that integrates these converging approaches to improving health care, is "rapid-learning health care." In this framework, routinely collected real-time clinical data drive the process of scientific discovery, which becomes a natural outgrowth of patient care. To better understand the state of the rapid-learning health care model and its potential implications for oncology, the National Cancer Policy Forum of the Institute of Medicine held a workshop entitled "A Foundation for Evidence-Driven Practice: A Rapid-Learning System for Cancer Care" in October 2009. Participants examined the elements of a rapid-learning system for cancer, including registries and databases, emerging information technology, patient-centered and -driven clinical decision support, patient engagement, culture change, clinical practice guidelines, point-of-care needs in clinical oncology, and federal policy issues and implications. This Special Article reviews the activities of the workshop and sets the stage to move from vision to action.
We present a learning-based approach to the semantic indexing of multimedia content using cues derived from audio, visual, and text features. We approach the problem by developing a set of statistical models for a predefined lexicon. Novel concepts are then mapped in terms of the concepts in the lexicon. To achieve robust detection of concepts, we exploit features from multiple modalities, namely, audio, video, and text. Concept representations are modeled using Gaussian mixture models (GMM), hidden Markov models (HMM), and support vector machines (SVM). Models such as Bayesian networks and SVMs are used in a latefusion approach to model concepts that are not explicitly modeled in terms of features. Our experiments indicate promise in the proposed classification and fusion methodologies: our proposed fusion scheme achieves more than 10% relative improvement over the best unimodal concept detector.
An application of neural network modeling is described for generating hypotheses about the relationships between response properties of neurons and information processing in the auditory system. The goal is to study response properties that are useful for extracting sound localization information from directionally selective spectral filtering provided by the pinna. For studying sound localization based on spectral cues provided by the pinna, a feedforward neural network model with a guaranteed level of fault tolerance is introduced. Fault tolerance and uniform fault tolerance in a neural network are formally defined and a method is described to ensure that the estimated network exhibits fault tolerance. The problem of estimating weights for such a network is formulated as a large-scale nonlinear optimization problem. Numerical experiments indicate that solutions with uniform fault tolerance exist for the pattern recognition problem considered. Solutions derived by introducing fault tolerance constraints have better generalization properties than solutions obtained via unconstrained back-propagation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.