Real-time lip tracking for audio-visual speech recognition applications

Kaucic, R.; Dalton, Barney; Blake, A.

doi:10.1007/3-540-61123-1_154

Cited by 54 publications

(30 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For example, a human face is composed of outer face contour, eyebrows, eyes, nose, and mouth. Analyzing the motion of structured deformable shapes has many real applications such as tracking human lips for speech recognition [1], locating human faces for face recognition [2], and medical applications such as tracking the endocardial wall [3]. The structured deformation is different from articulated motion.…”

Section: Introductionmentioning

confidence: 99%

Sequential mean field variational analysis of structured deformable shapes

Hua¹,

Wu²

2006

Computer Vision and Image Understanding

View full text Add to dashboard Cite

Section: Introductionmentioning

confidence: 99%

Sequential mean field variational analysis of structured deformable shapes

Hua¹,

Wu²

2006

Computer Vision and Image Understanding

View full text Add to dashboard Cite

“…In audio-visual speech recognition [17,19], visual features obtained by tracking the movement of lips and mouths are combined with audio features for improved speech recognition. In audio-visual object detection and tracking [3,8], synchronized visual foreground objects and audio background sounds are used for object detection [8].…”

Section: Introductionmentioning

confidence: 99%

Short-term audio-visual atoms for generic video concept classification

Jiang

Cotton

Chang

et al. 2009

Proceedings of the 17th ACM International Conference on Multimedia

View full text Add to dashboard Cite

We investigate the challenging issue of joint audio-visual analysis of generic videos targeting at semantic concept detection. We propose to extract a novel representation, the Short-term Audio-Visual Atom (S-AVA), for improved concept detection. An S-AVA is defined as a short-term region track associated with regional visual features and background audio features. An effective algorithm, named ShortTerm Region tracking with joint Point Tracking and Region Segmentation (STR-PTRS), is developed to extract S-AVAs from generic videos under challenging conditions such as uneven lighting, clutter, occlusions, and complicated motions of both objects and camera. Discriminative audio-visual codebooks are constructed on top of S-AVAs using Multiple Instance Learning. Codebook-based features are generated for semantic concept detection. We extensively evaluate our algorithm over Kodak's consumer benchmark video set from real users. Experimental results confirm significant performance improvements -over 120% MAP gain compared to alternative approaches using static region segmentation without temporal tracking. The joint audio-visual features also outperform visual features alone by an average of 8.5% (in terms of AP) over 21 concepts, with many concepts achieving more than 20%. Categories and Subject Descriptors General Terms Algorithms, Experimentation KeywordsSemantic concept detection, joint audio-visual analysis, shortterm audio-visual atom, audio-visual codebook Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM '09, October 19-24, 2009, Beijing, China. Copyright 2009 track of a birthday cake and the background birthday music form a salient audio-visual cue to describe "birthday" videos. The short-term region track of a horse and the background horse running footstep sound give a salient audio-visual cue for the "animal" concept.

show abstract

“…A method based on b-splines and Kalman filters has been described in [12 ]. A stochastic dynamic model is learned from example sequences which enhances the tracking speed and robustness to distractions.…”

Section: Related Workmentioning

confidence: 99%

Locating and tracking facial speech features

Luettin

Thacker

Beet

1996

Proceedings of 13th International Conference on Pattern Recognition

View full text Add to dashboard Cite

This paper describes a robust method for extracting visual speech information from the shape of lips to be used for an automatic speechreading (lipreading) systems. Lip deformation is modelled by a statistically based deformable contour model which learns typical lip deformation from a training set. The main difficulty in locating and tracking lips consists of finding dominant image features for representing the lip contours. We describe the use of a statistical profile model which learns dominant image features from a training set. The model captures global intensity variation due to different illumination and different skin reflectance as well as intensity changes at the inner lip contour due to mouth opening and visibility of teeth and tongue. The method is validated for locating and tracking lip movements on a database of a broad variety of speakers.

show abstract

Real-time lip tracking for audio-visual speech recognition applications

Cited by 54 publications

References 19 publications

Sequential mean field variational analysis of structured deformable shapes

Sequential mean field variational analysis of structured deformable shapes

Short-term audio-visual atoms for generic video concept classification

Locating and tracking facial speech features

Contact Info

Product

Resources

About