Visual information from a speaker's mouth region is known to improve automatic speech recognition robustness. However, the vast majority of audio-visual automatic speech recognition (AVASR) studies assume frontal images of the speaker's face, which is not always the case in realistic humancomputer interaction (HCI) scenarios. One such case of interest is HCI inside smart rooms, equipped with pan-tilt-zoom (PTZ) cameras that closely track the subject's head. Since however these cameras are fixed in space, they cannot necessarily obtain frontal views of the speaker. Clearly, AVASR from non-frontal views is required, as well as fusion of multiple camera views, if available. In this paper, we report our very preliminary work on this subject. In particular, we concentrate on two topics: First, the design of an AVASR system that operates on profile face views and its comparison with a traditional frontal-view AVASR system, and second, the fusion of the two systems into a multi-view frontal/profile system. We in particular describe our visual front end approach for the profile view system, and report experiments on a multi-subject, small-vocabulary, bimodal, multi-sensory database that contains synchronously captured audio with frontal and profile face video, recorded inside the IBM smart room as part of the CHIL project. Our experiments demonstrate that AVASR is possible from profile views, however the visual modality benefit is decreased compared to frontal video data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.