A comparative study of English viseme recognition methods and algorithms

Jachimski, Dawid; Czyżewski, Andrzej; Ciszewski, Tomasz

doi:10.1007/s11042-017-5217-5

Cited by 18 publications

(6 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…That is because characters like 'p' and 'b' belong to the same viseme class. Similar is the case with expressions like "Elephant Juice" and "I love you" which though having similar visemic appearances definitely have very different sounds and meanings [26].…”

Section: Introductionsupporting

confidence: 53%

Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

Kumar¹,

Nawal

Satoh

et al. 2018

Proceedings of the 26th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Speechreading or lipreading is the technique of understanding and getting phonetic features from a speaker's visual features such as movement of lips, face, teeth and tongue. It has a wide range of multimedia applications such as in surveillance, Internet telephony, and as an aid to a person with hearing impairments. However, most of the work in speechreading has been limited to text generation from silent videos. Recently, research has started venturing into generating (audio) speech from silent video sequences but there have been no developments thus far in dealing with divergent views and poses of a speaker. Thus although, we have multiple camera feeds for the speech of a user, but we have failed in using these multiple video feeds for dealing with the different poses. To this end, this paper presents the world's first ever multi-view speech reading and reconstruction system. This work encompasses the boundaries of multimedia research by putting forth a model which leverages silent video feeds from multiple cameras recording the same subject to generate intelligent speech for a speaker. Initial results confirm the usefulness of exploiting multiple camera views in building an efficient speech reading and reconstruction system. It further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Next, it lays out various innovative applications for the proposed system focusing on its potential prodigious impact in not just security arena but in many other multimedia analytics problems.

show abstract

Section: Introductionsupporting

confidence: 53%

Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

Kumar¹,

Nawal

Satoh

et al. 2018

Proceedings of the 26th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…A viseme is the visual equivalent of the phoneme: a static image of a person articulating a phoneme ( Dong et al, 2003 ). There are some phonemes that share identical visemes ( Cappelletta and Harte, 2012 , Lucey et al, 2004 , Mahavidyalaya, 2014 ), but for the vowels of the syllables used in this study, the visemes are clearly distinguishable (see illustrations in Jachimski et al, 2018 ), which is of importance given that we present visual-only trials as well. The syllables were edited using Audacity (version 3.0.2) in order to be cut and adjusted to the same duration of 400 ms.…”

Section: Methodsmentioning

confidence: 84%

The timecourse of multisensory speech processing in unilaterally stimulated cochlear implant users revealed by ERPs

Layer

Weglage

Müller

et al. 2022

NeuroImage: Clinical

View full text Add to dashboard Cite

“…The specific data is shown in Table 1. The new British English dataset contains 11 visemes [27], 8 hand shapes and 4 hand positions to encode 17 vowels and 24 consonants. RGB video images of the interpreter's upper body are available at 25 fps, and the spatial resolution is 720 × 1280.…”

Section: Datasetmentioning

confidence: 99%

“…Ours(multi) represents our model in the multi-speaker scenario. +S3 23 27. 30.48 27.36 26.20 46.79 46.60 Ours-SANs 21.8 27.21 25.41 25.96 43.89 45.42 Ours 20.37 26.16 23.39 24.88 38.42 42.02 Ours(multi) -19.93 17.64 14.59 20.77 26.38 5.…”

mentioning

confidence: 99%

An Attention Self-Supervised Contrastive Learning Based Three-Stage Model for Hand Shape Feature Representation in Cued Speech

Wang

Gu²,

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

Cued Speech (CS) is a communication system for deaf people or hearing impaired people, in which a speaker uses it to aid a lipreader in phonetic level by clarifying potentially ambiguous mouth movements with hand shape and positions. Feature extraction of multi-modal CS is a key step in CS recognition. Recent supervised deep learning based methods suffer from noisy CS data annotations especially for hand shape modality. In this work, we first propose a self-supervised contrastive learning method to learn the feature representation of image without using labels. Secondly, a small amount of manually annotated CS data are used to fine-tune the first module. Thirdly, we present a module, which combines Bi-LSTM and self-attention networks to further learn sequential features with temporal and contextual information. Besides, to enlarge the volume and the diversity of the current limited CS datasets, we build a new British English dataset containing 5 native CS speakers. Evaluation results on both French and British English datasets show that our model achieves over 90% accuracy in hand shape recognition. Significant improvements of 8.75% (for French) and 10.09% (for British English) are achieved in CS phoneme recognition correctness compared with the state-of-the-art.

show abstract

A comparative study of English viseme recognition methods and algorithms

Cited by 18 publications

References 26 publications

Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

The timecourse of multisensory speech processing in unilaterally stimulated cochlear implant users revealed by ERPs

An Attention Self-Supervised Contrastive Learning Based Three-Stage Model for Hand Shape Feature Representation in Cued Speech

Contact Info

Product

Resources

About