Ashutosh Garg scite author profile

Abstract-Visual speech information from the speaker's mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability into the human computer interface. In this paper, we review the main components of audio-visual automatic speech recognition and present novel contributions in two main areas: First, the visual front end design, based on a cascade of linear image transforms of an appropriate video region-of-interest, and subsequently, audio-visual speech integration. On the later topic, we discuss new work on feature and decision fusion combination, the modeling of audio-visual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audiovisual speaker adaptation. We apply our algorithms to three multi-subject bimodal databases, ranging from small-to largevocabulary recognition tasks, recorded at both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves automatic speech recognition over all conditions and data considered, however less so for visually challenging environments and large vocabulary tasks.

show abstract

Facial expression recognition from video sequences: temporal and static modeling

Cohen

Sebe

Garg

et al. 2003

Computer Vision and Image Understanding

757

404

View full text Add to dashboard Cite

show abstract

Layered representations for learning and inferring office activity from multiple sensory channels

Oliver

Garg

Horvitz

2004

Computer Vision and Image Understanding

265

170

View full text Add to dashboard Cite

We present the use of layered probabilistic representations for modeling human activities, and describe how we use the representation to do sensing, learning, and inference at multiple levels of temporal granularity and abstraction and from heterogeneous data sources. The approach centers on the use of a cascade of Hidden Markov Models named Layered Hidden Markov Models (LHMMs) to diagnose states of a userÕs activity based on real-time streams of evidence from video, audio, and computer (keyboard and mouse) interactions. We couple these LHMMs with an expected utility analysis that considers the cost of misclassification. We describe the representation, present an implementation, and report on experiments with our layered architecture in a real-time office-awareness setting.

show abstract

Layered representations for human activity recognition

View full text Add to dashboard Cite

Google news personalization

et al. 2007

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ashutosh Garg

Recent advances in the automatic recognition of audiovisual speech

Facial expression recognition from video sequences: temporal and static modeling

Layered representations for learning and inferring office activity from multiple sensory channels

Layered representations for human activity recognition

Google news personalization

Contact Info

Product

Resources

About