Decoding visemes: Improving machine lip-reading

Bear, Helen L.; Harvey, Richard W.

doi:10.1109/icassp.2016.7472029

Cited by 27 publications

(33 citation statements)

References 72 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However in our case, lipreading, which is useful for understanding speech when audio speech is too noisy to recognize easily, is classifying speech from only the visual information channel in speech signals thus, as we shall present, we use a novel training method which uses new visual units and phonemes in a complimentary fashion. This paper is an extended version of our prior work [5,31], this work is relevant to all classifiers since the choice of visual unit matters and is made before the classifier is trained. In other words, the choice of visual units must be made early in the design process and a non-optimal choice can be very expensive in terms of performance.…”

Section: Word Entry Phoneme Dictionary Viseme Dictionarymentioning

confidence: 99%

Alternative Visual Units for an Optimized Phoneme-Based Lipreading System

Bear

Harvey

2019

Applied Sciences

Self Cite

View full text Add to dashboard Cite

Lipreading is understanding speech from observed lip movements. An observed series of lip motions is an ordered sequence of visual lip gestures. These gestures are commonly known, but as yet are not formally defined, as 'visemes'. In this article, we describe a structured approach which allows us to create speaker-dependent visemes with a fixed number of visemes within each set. We create sets of visemes for sizes two to 45. Each set of visemes is based upon clustering phonemes, thus each set has a unique phoneme-to-viseme mapping. We first present an experiment using these maps and the Resource Management Audio-Visual (RMAV) dataset which shows the effect of changing the viseme map size in speaker-dependent machine lipreading and demonstrate that word recognition with phoneme classifiers is possible. Furthermore, we show that there are intermediate units between visemes and phonemes which are better still. Second, we present a novel two-pass training scheme for phoneme classifiers. This approach uses our new intermediary visual units from our first experiment in the first pass as classifiers; before using the phoneme-to-viseme maps, we retrain these into phoneme classifiers. This method significantly improves on previous lipreading results with RMAV speakers. Dataset:RMAV Active Appearance Model Features can be found at

show abstract

Section: Word Entry Phoneme Dictionary Viseme Dictionarymentioning

confidence: 99%

Alternative Visual Units for an Optimized Phoneme-Based Lipreading System

Bear

Harvey

2019

Applied Sciences

Self Cite

View full text Add to dashboard Cite

show abstract

“…For this, they generate visemes and compare them to speech units such as words, syllables, or phonemes. By viseme, as discussed by [4], there is no standard definition, with a range of possible definitions such as "a set of phonemes that have identical appearance on the lips" [5]. However, there are some limitations with this approach.…”

Section: Introductionmentioning

confidence: 99%

“…However, there are some limitations with this approach. There is not a complete one-to-one mapping of phonemes to visemes, as one viseme can be mapped to several phonemes [4], which makes classification challenging. Another issue with using visemes is co-articulation, where a speaker starts to form words before they are spoken, resulting in a phone being pronounced differently due to the effect of adjacent phonemes, which was identified to have a negative effect on lipreading results [4].…”

Section: Introductionmentioning

confidence: 99%

“…There is not a complete one-to-one mapping of phonemes to visemes, as one viseme can be mapped to several phonemes [4], which makes classification challenging. Another issue with using visemes is co-articulation, where a speaker starts to form words before they are spoken, resulting in a phone being pronounced differently due to the effect of adjacent phonemes, which was identified to have a negative effect on lipreading results [4]. Finally, while a linguistic approach (using phonemes and linguistic information) can be particularly useful for speech recognition, in speech filtering technologies, a frame based approach [7] is often used, i.e.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Data Driven Approach to Audiovisual Speech Mapping

Abel

Marxer

Barker

et al. 2016

Advances in Brain Inspired Cognitive Systems

View full text Add to dashboard Cite

Abstract. The concept of using visual information as part of audio speech processing has been of significant recent interest. This paper presents a data driven approach that considers estimating audio speech acoustics using only temporal visual information without considering linguistic features such as phonemes and visemes. Audio (log filterbank) and visual (2D-DCT) features are extracted, and various configurations of MLP and datasets are used to identify optimal results, showing that given a sequence of prior visual frames an equivalent reasonably accurate audio frame estimation can be mapped.

show abstract

“…Several researches have reported increased performance of multimodal systems when operating in noise compared to uni-modal acoustic speech recognition systems (Chibelushi et al 1996), Kashiwagi et al (2012), Potamianos et al (2003), Stewart et al (2014). Well established studies in the field of the Audio Visual Speech Recognition (AVSR) employ parametrization of facial features using Active Appearance Models (AAM) (Nguyen and Milgram 2009) and viseme recognition utilizing Hidden Markov Models (HMM) (Bear and Harvey 2016) or Dynamic Bayesian Networks (Jadczyk and Ziółko 2015). The most recent works employ Deep Neural Networks (DNN) (Almajai et al 2016), Mroueh et al (2015) and Convolutional Neural Networks (CNN) (Noda et al 2015) serving as a front-end for audio and visual feature extraction.…”

Section: Introductionmentioning

confidence: 99%

An audio-visual corpus for multimodal automatic speech recognition

et al. 2017

View full text Add to dashboard Cite

A review of available audio-visual speech corpora and a description of a new multimodal corpus of English speech recordings is provided. The new corpus containing 31 hours of recordings was created specifically to assist audio-visual speech recognition systems (AVSR) development. The database related to the corpus includes high-resolution, high-framerate stereoscopic video streams from RGB cameras, depth imaging stream utilizing Time-of-Flight camera accompanied by audio recorded using both: a microphone array and a microphone built in a mobile computer. For the purpose of applications related to AVSR systems training, every utterance was manually labeled, resulting in label files added to the corpus repository. Owing to the inclusion of recordings made in noisy conditions the elaborated corpus can also be used for testing robustness of speech recognition systems in the presence of acoustic background noise. The process of building the corpus, including the recording, labeling and post-processing phases is described in the paper. Results achieved with the developed audio-visual automatic speech recognition (ASR) engine trained and tested with the material contained in the corpus are presented and discussed together with comparative test results employing a state-of-the-art/commercial ASR engine. In order to demonstrate the practical use of the corpus it is made available for the public use.

show abstract

Decoding visemes: Improving machine lip-reading

Cited by 27 publications

References 72 publications

Alternative Visual Units for an Optimized Phoneme-Based Lipreading System

Alternative Visual Units for an Optimized Phoneme-Based Lipreading System

A Data Driven Approach to Audiovisual Speech Mapping

An audio-visual corpus for multimodal automatic speech recognition

Contact Info

Product

Resources

About