2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016
DOI: 10.1109/icassp.2016.7472029
|View full text |Cite
|
Sign up to set email alerts
|

Decoding visemes: Improving machine lip-reading

Abstract: To undertake machine lip-reading, we try to recognise speech from a visual signal. Current work often uses viseme classification supported by language models with varying degrees of success. A few recent works suggest phoneme classification, in the right circumstances, can outperform viseme classification. In this work we present a novel two-pass method of training phoneme classifiers which uses previously trained visemes in the first pass. With our new training algorithm, we show classification performance wh… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
30
0
2

Year Published

2016
2016
2022
2022

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 27 publications
(33 citation statements)
references
References 72 publications
1
30
0
2
Order By: Relevance
“…However in our case, lipreading, which is useful for understanding speech when audio speech is too noisy to recognize easily, is classifying speech from only the visual information channel in speech signals thus, as we shall present, we use a novel training method which uses new visual units and phonemes in a complimentary fashion. This paper is an extended version of our prior work [5,31], this work is relevant to all classifiers since the choice of visual unit matters and is made before the classifier is trained. In other words, the choice of visual units must be made early in the design process and a non-optimal choice can be very expensive in terms of performance.…”
Section: Word Entry Phoneme Dictionary Viseme Dictionarymentioning
confidence: 99%
“…However in our case, lipreading, which is useful for understanding speech when audio speech is too noisy to recognize easily, is classifying speech from only the visual information channel in speech signals thus, as we shall present, we use a novel training method which uses new visual units and phonemes in a complimentary fashion. This paper is an extended version of our prior work [5,31], this work is relevant to all classifiers since the choice of visual unit matters and is made before the classifier is trained. In other words, the choice of visual units must be made early in the design process and a non-optimal choice can be very expensive in terms of performance.…”
Section: Word Entry Phoneme Dictionary Viseme Dictionarymentioning
confidence: 99%
“…For this, they generate visemes and compare them to speech units such as words, syllables, or phonemes. By viseme, as discussed by [4], there is no standard definition, with a range of possible definitions such as "a set of phonemes that have identical appearance on the lips" [5]. However, there are some limitations with this approach.…”
Section: Introductionmentioning
confidence: 99%
“…However, there are some limitations with this approach. There is not a complete one-to-one mapping of phonemes to visemes, as one viseme can be mapped to several phonemes [4], which makes classification challenging. Another issue with using visemes is co-articulation, where a speaker starts to form words before they are spoken, resulting in a phone being pronounced differently due to the effect of adjacent phonemes, which was identified to have a negative effect on lipreading results [4].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Several researches have reported increased performance of multimodal systems when operating in noise compared to uni-modal acoustic speech recognition systems (Chibelushi et al 1996), Kashiwagi et al (2012), Potamianos et al (2003), Stewart et al (2014). Well established studies in the field of the Audio Visual Speech Recognition (AVSR) employ parametrization of facial features using Active Appearance Models (AAM) (Nguyen and Milgram 2009) and viseme recognition utilizing Hidden Markov Models (HMM) (Bear and Harvey 2016) or Dynamic Bayesian Networks (Jadczyk and Ziółko 2015). The most recent works employ Deep Neural Networks (DNN) (Almajai et al 2016), Mroueh et al (2015) and Convolutional Neural Networks (CNN) (Noda et al 2015) serving as a front-end for audio and visual feature extraction.…”
Section: Introductionmentioning
confidence: 99%