2017
DOI: 10.1007/s11042-017-5217-5
|View full text |Cite
|
Sign up to set email alerts
|

A comparative study of English viseme recognition methods and algorithms

Abstract: An elementary visual unit -the viseme is concerned in the paper in the context of preparing the feature vector as a main visual input component of Audio-Visual Speech Recognition systems. The aim of the presented research is a review of various approaches to the problem, the implementation of algorithms proposed in the literature and a comparative research on their effectiveness. In the course of the study an optimal feature vector construction and an appropriate selection of the classifier were sought. The ex… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
5
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 18 publications
(6 citation statements)
references
References 26 publications
1
5
0
Order By: Relevance
“…That is because characters like 'p' and 'b' belong to the same viseme class. Similar is the case with expressions like "Elephant Juice" and "I love you" which though having similar visemic appearances definitely have very different sounds and meanings [26].…”
Section: Introductionsupporting
confidence: 53%
“…That is because characters like 'p' and 'b' belong to the same viseme class. Similar is the case with expressions like "Elephant Juice" and "I love you" which though having similar visemic appearances definitely have very different sounds and meanings [26].…”
Section: Introductionsupporting
confidence: 53%
“…A viseme is the visual equivalent of the phoneme: a static image of a person articulating a phoneme ( Dong et al, 2003 ). There are some phonemes that share identical visemes ( Cappelletta and Harte, 2012 , Lucey et al, 2004 , Mahavidyalaya, 2014 ), but for the vowels of the syllables used in this study, the visemes are clearly distinguishable (see illustrations in Jachimski et al, 2018 ), which is of importance given that we present visual-only trials as well. The syllables were edited using Audacity (version 3.0.2) in order to be cut and adjusted to the same duration of 400 ms.…”
Section: Methodsmentioning
confidence: 84%
“…The specific data is shown in Table 1. The new British English dataset contains 11 visemes [27], 8 hand shapes and 4 hand positions to encode 17 vowels and 24 consonants. RGB video images of the interpreter's upper body are available at 25 fps, and the spatial resolution is 720 × 1280.…”
Section: Datasetmentioning
confidence: 99%
“…Ours(multi) represents our model in the multi-speaker scenario. +S3 23 27. 30.48 27.36 26.20 46.79 46.60 Ours-SANs 21.8 27.21 25.41 25.96 43.89 45.42 Ours 20.37 26.16 23.39 24.88 38.42 42.02 Ours(multi) -19.93 17.64 14.59 20.77 26.38 5.…”
mentioning
confidence: 99%