Audio-Visual ASR from Multiple Views inside Smart Rooms

Potamianos, G.; Lucey, Patrick

doi:10.1109/mfi.2006.265643

Cited by 5 publications

(3 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multimodal sensing of human behavior has gained significant engineering interest in recent years, raising research challenges in fields such as signal processing [38], computer vision [39], robotics [40], speech recognition [41], and mobile sensing [42]. Since human behavior observations are desired in a variety of settings, from constrained structured ones to unconstrained unstructured environments, a wide range of acquisition approaches have been proposed to suit the specific application needs.…”

Section: Aspects Of Behavioral Signal Processingmentioning

confidence: 99%

Behavioral Signal Processing: Deriving Human Behavioral Informatics From Speech and Language

2013

View full text Add to dashboard Cite

The expression and experience of human behavior are complex and multimodal and characterized by individual and contextual heterogeneity and variability. Speech and spoken language communication cues offer an important means for measuring and modeling human behavior. Observational research and practice across a variety of domains from commerce to healthcare rely on speech- and language-based informatics for crucial assessment and diagnostic information and for planning and tracking response to an intervention. In this paper, we describe some of the opportunities as well as emerging methodologies and applications of human behavioral signal processing (BSP) technology and algorithms for quantitatively understanding and modeling typical, atypical, and distressed human behavior with a specific focus on speech- and language-based communicative, affective, and social behavior. We describe the three important BSP components of acquiring behavioral data in an ecologically valid manner across laboratory to real-world settings, extracting and analyzing behavioral cues from measured data, and developing models offering predictive and decision-making support. We highlight both the foundational speech and language processing building blocks as well as the novel processing and modeling opportunities. Using examples drawn from specific real-world applications ranging from literacy assessment and autism diagnostics to psychotherapy for addiction and marital well being, we illustrate behavioral informatics applications of these signal processing techniques that contribute to quantifying higher level, often subjectively described, human behavior in a domain-sensitive fashion.

show abstract

Section: Aspects Of Behavioral Signal Processingmentioning

confidence: 99%

Behavioral Signal Processing: Deriving Human Behavioral Informatics From Speech and Language

2013

View full text Add to dashboard Cite

show abstract

“…There are several databases which have been developed for AVASR, such as IBM smart-room database [8], CUAVE database [9]. Unfortunately, most of these databases are captured in ideal video conditions.…”

Section: Experimental Datamentioning

confidence: 99%

Lip detection for audio-visual speech recognition in-car environment

Navarathna

Lucey

Fookes

et al. 2010

10th International Conference on Information Science, Signal Processing and Their Applications (ISSPA 2010)

Self Cite

View full text Add to dashboard Cite

Acoustically, car cabins are extremely noisy and as a consequence audio-only, in-car voice recognition sys tems perform poorly. As the visual modality is immune to acoustic noise, using the visual lip information from the driver is seen as a viable strategy in circumventing this problem by using audio visual automatic speech recogni tion (AVASR). However, implementing AVASR requires a system being able to accurately locate and track the drivers face and lip area in real-time. In this paper we present such an approach using the Viola-Jones algorithm. Using the AVICAR [1] in-car database, we show that the Viola Jones approach is a suitable method of locating and track ing the driver's lips despite the visual variability of illumi nation and head pose for audio-visual speech recognition system.

show abstract

“…Such location information can be further utilized in support of numerous audio-visual perception technologies: For example, 2D face information is useful for person identification [6], whereas 3D location coordinates can be employed in acoustic beamforming for far-field automatic speech recognition [7], as well as to obtain close-up presenter views based on steerable pan-tilt-zoom cameras [8,9] or camera selection schemes [10]. The views can further assist identification [11] and audio-visual speech technologies [12], among others, with obvious utility in lecture indexing and understanding of the interaction.…”

Section: Introductionmentioning

confidence: 99%

Joint face and head tracking inside multi-camera smart rooms

Zhang

Potamianos

Senior

et al. 2007

SIViP

Self Cite

View full text Add to dashboard Cite

The paper introduces a novel detection and tracking system that provides both frame-view and world-coordinate human location information, based on video from multiple synchronized and calibrated cameras with overlapping fields of view. The system is developed and evaluated for the specific scenario of a seminar lecturer presenting in front of an audience inside a "smart room", its aim being to track the lecturer's head centroid in the three-dimensional (3D) space and also yield two-dimensional (2D) face information in the available camera views. The proposed approach is primarily based on a statistical appearance model of human faces by means of wellknown AdaBoost-like face detectors, extended to address the head pose variation observed in the smart room scenario of interest. The appearance module is complemented by two novel components and assisted by a simple tracking drift detection mechanism. The first component of interest is the initialization module, which employs a spatio-temporal dynamic programming approach with appropriate penalty functions to obtain optimal 3D location hypotheses. The second is an adaptive subspace learning based 2D tracking scheme with a novel forgetting mechanism, introduced as a means to reduce tracking drift and increase robustness to illumination and head pose variation. System performance is benchmarked on an extensive database of realistic human interaction in the lecture smart room scenario, collected as part of the European integrated project "CHIL". The system consistently achieves excellent tracking precision, with a 3D mean tracking error of less than 16 cm, and is demonstrated to outperform four alternative tracking schemes. Furthermore, the proposed system performs relatively well in detecting frontal and near-frontal faces in the available frame views.

show abstract

Audio-Visual ASR from Multiple Views inside Smart Rooms

Cited by 5 publications

References 12 publications

Behavioral Signal Processing: Deriving Human Behavioral Informatics From Speech and Language

Behavioral Signal Processing: Deriving Human Behavioral Informatics From Speech and Language

Lip detection for audio-visual speech recognition in-car environment

Joint face and head tracking inside multi-camera smart rooms

Contact Info

Product

Resources

About