Discriminative Analysis of Lip Motion Features for Speaker Identification and Speech-Reading

Çetingül, H. Ertan; Yemez, Y.; Erzin, Engin; Tekalp, A.M.

doi:10.1109/tip.2006.877528

Cited by 101 publications

(37 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This block is identical to that of an audio-only ASR system and the features most commonly used are perceptual linear predictive [16] or Mel frequency cepstral coefficients [17,18]. In parallel, the face of the speaker has to be localized from the video sequence and the region of the mouth detected and normalized before relevant features can be extracted [1,19]. Typically, both audio and visual features are extended to include some temporal information of the speech process.…”

Section: Audio-visual Speech Recognitionmentioning

confidence: 99%

Multi-pose lipreading and audio-visual speech recognition

Estellers

Thiran

2012

EURASIP J. Adv. Signal Process.

View full text Add to dashboard Cite

In this article, we study the adaptation of visual and audio-visual speech recognition systems to non-ideal visual conditions. We focus on overcoming the effects of a changing pose of the speaker, a problem encountered in natural situations where the speaker moves freely and does not keep a frontal pose with relation to the camera. To handle these situations, we introduce a pose normalization block in a standard system and generate virtual frontal views from non-frontal images. The proposed method is inspired by pose-invariant face recognition and relies on linear regression to find an approximate mapping between images from different poses. We integrate the proposed pose normalization block at different stages of the speech recognition system and quantify the loss of performance related to pose changes and pose normalization techniques. In audio-visual experiments we also analyze the integration of the audio and visual streams. We show that an audio-visual system should account for non-frontal poses and normalization techniques in terms of the weight assigned to the visual stream in the classifier.

show abstract

Section: Audio-visual Speech Recognitionmentioning

confidence: 99%

Multi-pose lipreading and audio-visual speech recognition

Estellers

Thiran

2012

EURASIP J. Adv. Signal Process.

View full text Add to dashboard Cite

show abstract

“…From this short literature review, we can conclude that the pixel based feature extraction techniques [1,3,5,14,17,20] are in general better fitted to encode the lips dynamics in a compact representation than the contour-based feature extraction methods [8,12,15]. Based on this conclusion, we formulated the visual speech recognition as the process of recognizing individual words based on a new manifold representation.…”

Section: Introductionmentioning

confidence: 95%

A PCA based manifold representation for visual speech recognition

Yu¹,

Ghita²,

Sutherland³

et al. 2007

China-Ireland International Conference on Information and Communications Technologies (CIICT 2007)

View full text Add to dashboard Cite

In this paper, we discuss a new Principal Component Analysis (PCA)-based manifold representation for visual speech recognition. In this regard, the real time input video data is compressed using Principal Component Analysis and the low-dimensional points calculated for each frame define the manifold. Since the number of frames that form the video sequence is dependent on the word complexity, in order to use these manifolds for visual speech classification it is required to re-sample them into a fixed pre-defined number of key-points. These key-points are used as input for a Hidden Markov Model (HMM) classification scheme. We have applied the developed visual speech recognition system to a database containing a group of English words and the experimental data indicates that the proposed approach is able to produce accurate classification results.

show abstract

“…The identity recognition based on lip movement as a biological characteristic among these enjoys a great potential since it is relatively simple in data collection and low in equipment cost. Lip feature information extraction is the most crucial step [1]. There are basically two ways of extraction, namely static approach and dynamic approach.…”

Section: Introductionmentioning

confidence: 99%