Trainable videorealistic speech animation

Ezzat, Tony; Geiger, Gad; Poggio, Tomaso

doi:10.1109/afgr.2004.1301509

Cited by 76 publications

(87 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…3. The cluster number is defined empirically, which is comparably less than that used in MMM [1]. The reason is that in facial feature template, the teeth markers are excluded as their appearance could not be traced robustly due to the low quality of video clips.…”

Section: K-means Clustering and 3d Viseme Databasementioning

confidence: 99%

“…The data analysis may be based on machine learning [1,2,4,6,7] or probabilistic framework [5]. Ezzat et al [1] employ a variant of MMM to synthesize mouth configurations of a novel speech. Cao et al [6] generate a data structure called Anime Graph to encapsulate motion captured facial motion database along with speech information.…”

Section: Related Workmentioning

confidence: 99%

“…Vision based speech animation systems generally utilize a large video database [1,4,5,6] and recently 3D motion capturing data [2] for training. The data analysis may be based on machine learning [1,2,4,6,7] or probabilistic framework [5].…”

Section: Related Workmentioning

confidence: 99%

“…Drawing inspiration from Ezzat's speech animation system on multidimensional morphable model (MMM) [1], we employ the Isomap to reduce dimensionality of facial configurations to discover the intrinsic speech structure. Kmeans clustering is applied on the low dimensional manifold to find key viseme definitions corresponding to a training corpus, which is selected to cover all vowels and consonants in Mandarin.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Vision Based Speech Animation Transferring with Underlying Anatomical Structure

Pei

Zha

2006

Computer Vision – ACCV 2006

View full text Add to dashboard Cite

Abstract. We present a novel method to transfer speech animation recorded in low resolution videos onto realistic 3D facial models. Unsupervised learning is utilized on a speech video corpus to find underlying manifold of facial configurations. K-means clustering is applied on the low dimensional space to find key speaking-related facial shapes. With a small set of laser scanner captured 3D models related to the clustering centroid, the facial animation in 2D videos is transferred onto 3D shapes. Especially by virtue of a weak perspective projection model, the underlying mandible rotation is recovered from videos and is utilized to drive 3D skull movements. The adaption of a generic skull onto facial models is guided by a 2D image, Tissue Map. With parsimonious data requirements, our system realizes the animation transferring and gains a realistic rendering effect with the underlying anatomical structure.

show abstract

Section: K-means Clustering and 3d Viseme Databasementioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Vision Based Speech Animation Transferring with Underlying Anatomical Structure

Pei

Zha

2006

Computer Vision – ACCV 2006

View full text Add to dashboard Cite

show abstract

“…This model may be based on marker point positions [7,9,15,18], 3D scans [3,14,25,28,30] or images [6,12].…”

Section: Introductionmentioning

confidence: 99%

A nonlinear viseme model for triphone-based speech synthesis

Bargmann¹,

Blanz²,

Seidel³

2008

2008 8th IEEE International Conference on Automatic Face &Amp; Gesture Recognition

View full text Add to dashboard Cite

This paper presents a new learning-based approach to speech synthesis that achieves mouth movements with rich and expressive articulation for novel audio input. From a database of 3D triphone motions, our algorithm picks the optimal sequences based on a triphone similarity measure, and concatenates them to create new utterances that include coarticulation effects. By using a Locally Linear Embedding (LLE) representation of feature points on 3D scans, we propose a model that defines a measure of similarity among visemes, and a system of viseme categories, which are used to define triphone substitution rules and a cost function. Moreover, we compute deformation vectors for several facial expressions, allowing expression variation to be smoothly added to the speech animation.In an entirely data-driven approach, our automated procedure for defining viseme categories closely reproduces the groups of related visemes that are defined in the phonetics literature. The structure of our selection method is intrinsic to the nature of speech and generates a substitution table that can be reused as-is in different speech animation systems.

show abstract

Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of diviseme motion capture data

Cole²,

Pellom³

et al. 2004

Computer Animation & Virtual

View full text Add to dashboard Cite

We present a technique for accurate automatic visible speech synthesis from textual input. When provided with a speech waveform and the text of a spoken sentence, the system produces accurate visible speech synchronized with the audio signal. To develop the system, we collected motion capture data from a speaker's face during production of a set of words containing all diviseme sequences in English. The motion capture points from the speaker's face are retargeted to the vertices of the polygons of a 3D face model. When synthesizing a new utterance, the system locates the required sequence of divisemes, shrinks or expands each diviseme based on the desired phoneme segment durations in the target utterance, then moves the polygons in the regions of the lips and lower face to correspond to the spatial coordinates of the motion capture data. The motion mapping is realized by a key-shape mapping function learned by a set of viseme examples in the source and target faces. A well-posed numerical algorithm estimates the shape blending coefficients. Time warping and motion vector blending at the juncture of two divisemes and the algorithm to search the optimal concatenated visible speech are also developed to provide the final concatenative motion sequence.

show abstract

Trainable videorealistic speech animation

Cited by 76 publications

References 44 publications

Vision Based Speech Animation Transferring with Underlying Anatomical Structure

Vision Based Speech Animation Transferring with Underlying Anatomical Structure

A nonlinear viseme model for triphone-based speech synthesis

Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of diviseme motion capture data

Contact Info

Product

Resources

About