2006 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems 2006
DOI: 10.1109/mfi.2006.265643
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Visual ASR from Multiple Views inside Smart Rooms

Abstract: Visual information from a speaker's mouth region is known to improve automatic speech recognition robustness. However, the vast majority of audio-visual automatic speech recognition (AVASR) studies assume frontal images of the speaker's face, which is not always the case in realistic humancomputer interaction (HCI) scenarios. One such case of interest is HCI inside smart rooms, equipped with pan-tilt-zoom (PTZ) cameras that closely track the subject's head. Since however these cameras are fixed in space, they … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2007
2007
2023
2023

Publication Types

Select...
3
2

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 12 publications
0
3
0
Order By: Relevance
“…Multimodal sensing of human behavior has gained significant engineering interest in recent years, raising research challenges in fields such as signal processing [38], computer vision [39], robotics [40], speech recognition [41], and mobile sensing [42]. Since human behavior observations are desired in a variety of settings, from constrained structured ones to unconstrained unstructured environments, a wide range of acquisition approaches have been proposed to suit the specific application needs.…”
Section: Aspects Of Behavioral Signal Processingmentioning
confidence: 99%
“…Multimodal sensing of human behavior has gained significant engineering interest in recent years, raising research challenges in fields such as signal processing [38], computer vision [39], robotics [40], speech recognition [41], and mobile sensing [42]. Since human behavior observations are desired in a variety of settings, from constrained structured ones to unconstrained unstructured environments, a wide range of acquisition approaches have been proposed to suit the specific application needs.…”
Section: Aspects Of Behavioral Signal Processingmentioning
confidence: 99%
“…There are several databases which have been developed for AVASR, such as IBM smart-room database [8], CUAVE database [9]. Unfortunately, most of these databases are captured in ideal video conditions.…”
Section: Experimental Datamentioning
confidence: 99%
“…Such location information can be further utilized in support of numerous audio-visual perception technologies: For example, 2D face information is useful for person identification [6], whereas 3D location coordinates can be employed in acoustic beamforming for far-field automatic speech recognition [7], as well as to obtain close-up presenter views based on steerable pan-tilt-zoom cameras [8,9] or camera selection schemes [10]. The views can further assist identification [11] and audio-visual speech technologies [12], among others, with obvious utility in lecture indexing and understanding of the interaction.…”
Section: Introductionmentioning
confidence: 99%