Resolution limits on visual speech recognition

Bear, Helen L.; Harvey, Richard W.; Theobald, Barry-John; Lan, Yong

doi:10.1109/icip.2014.7025274

Cited by 18 publications

(14 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For each frame a single feature vector is extracted which is the concatenation of the shape and appearance parameters. There are many examples of speaker-dependent AAMs improving MLR [2].…”

Section: A Featuresmentioning

confidence: 99%

The speaker-independent lipreading play-off; a survey of lipreading machines

Burton¹,

Frank²,

Saleh³

et al. 2018

2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS)

Self Cite

View full text Add to dashboard Cite

Lipreading is a difficult gesture classification task. One problem in computer lipreading is speaker-independence. Speaker-independence means to achieve the same accuracy on test speakers not included in the training set as speakers within the training set. Current literature is limited on speakerindependent lipreading, the few independent test speaker accuracy scores are usually aggregated within dependent test speaker accuracies for an averaged performance. This leads to unclear independent results. Here we undertake a systematic survey of experiments with the TCD-TIMIT dataset using both conventional approaches and deep learning methods to provide a series of wholly speaker-independent benchmarks and show that the best speaker-independent machine scores 69.58% accuracy with CNN features and an SVM classifier. This is less than stateof-the-art speaker-dependent lipreading machines, but greater than previously reported in independence experiments.

show abstract

“…For each frame a single feature vector is extracted which is the concatenation of the shape and appearance parameters. There are many examples of speaker-dependent AAMs improving MLR [2].…”

Section: A Featuresmentioning

confidence: 99%

The speaker-independent lipreading play-off; a survey of lipreading machines

Burton¹,

Frank²,

Saleh³

et al. 2018

2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Speaker appearance, or identity, is known to be important in the recognition of speech from visual-only information (lipreading) [33], more so than in auditory speech. Indeed appearance data improves lipreading classification over shape only models whether one uses Active Appearance Models (AAM) [28] or Discrete Cosine Tranform (DCT) [10] features . 2 In machine lipreading we have interesting evidence: we can both identify individuals from visual speech information [34,35] and, with deep learning and big data, we have the potential to generalise over many speakers [8,36].…”

Section: Speaker-specific Visemesmentioning

confidence: 99%

Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals

Bear

Harvey

2018

Computer Speech & Language

Self Cite

View full text Add to dashboard Cite

Visual lip gestures observed whilst lipreading have a few working definitions, the most common two are; 'the visual equivalent of a phoneme' and 'phonemes which are indistinguishable on the lips'. To date there is no formal definition, in part because to date we have not established a two-way relationship or mapping between visemes and phonemes. Some evidence suggests that visual speech is highly dependent upon the speaker. So here, we use a phoneme-clustering method to form new phoneme-to-viseme maps for both individual and multiple speakers. We test these phoneme to viseme maps to examine how similarly speakers talk visually and we use signed rank tests to measure the distance between individuals. We conclude that broadly speaking, speakers have the same repertoire of mouth gestures, where they differ is in the use of the gestures.

show abstract

“…Yes [14][15][16] Yes, [9] Unit choice Yes, [17][18][19][20][21] Yes, [3,4,[22][23][24] Classifier technology Yes, [17,[25][26][27][28] Multiple persons Yes, [29][30][31][32]…”

Section: Video Qualitymentioning

confidence: 99%

Phoneme-to-viseme mappings: the good, the bad, and the ugly

Bear

Harvey

2017

Speech Communication

Self Cite

View full text Add to dashboard Cite

Visemes are the visual equivalent of phonemes. Although not precisely defined, a working definition of a viseme is “a set of phonemes which have identical appearance on the lips”. Therefore a phoneme falls into one viseme class but a viseme may represent many phonemes: a many to one mapping. This mapping introduces ambiguity between phonemes when using viseme classifiers. Not only is this ambiguity damaging to the performance of audio-visual classifiers operating on real expressive speech, there is also considerable choice between possible mappings. In this paper we explore the issue of this choice of viseme-to-phoneme map. We show that there is definite difference in performance between viseme-to-phoneme mappings and explore why some maps appear to work better than others. We also devise a new algorithm for constructing phoneme-to-viseme mappings from labeled speech data. These new visemes, ‘Bear’ visemes, are shown to perform better than previously known units

show abstract

Resolution limits on visual speech recognition

Cited by 18 publications

References 11 publications

The speaker-independent lipreading play-off; a survey of lipreading machines

The speaker-independent lipreading play-off; a survey of lipreading machines

Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals

Phoneme-to-viseme mappings: the good, the bad, and the ugly

Contact Info

Product

Resources

About