We describe how to create with machine learning techniques a generative, videorealistic, speech animation module. A human subject is first recorded using a videocamera as he/she utters a predetermined speech corpus. After processing the corpus automatically, a visual speech module is learned from the data that is capable of synthesizing the human subject's mouth uttering entirely novel utterances that were not recorded in the original video. The synthesized utterance is re-composited onto a background sequence which contains natural head and eye movement. The final output is videorealistic in the sense that it looks like a video camera recording of the subject. At run time, the input to the system can be either real audio sequences or synthetic audio produced by a text-to-speech system, as long as they have been phonetically aligned.The two key contributions of this paper are 1) a variant of the multidimensional morphable model (MMM) to synthesize new, previously unseen mouth configurations from a small set of mouth image prototypes; and 2) a trajectory synthesis technique based on regularization, which is automatically trained from the recorded video corpus, and which is capable of synthesizing trajectories in MMM space corresponding to any desired utterance. Figure 1: Some of the synthetic facial configurations output by our system.audio can be either real human audio (from the same subject or a different subject), or synthetic audio produced by a text-to-speech system. All that is required by our system is that the audio be phonetically transcribed and aligned. In the case of synthetic audio from TTS systems, this phonetic alignment is readily available from the TTS system itself [Black and Taylor 1997]. In the case of real audio, publicly available phonetic alignment systems [Huang et al. 1993] may be used.Our visual speech processing system is composed of two modules: The first module is the multidimensional morphable model (MMM), which is capable of morphing between a small set of prototype mouth images to synthesize new, previously unseen mouth configurations. The second component is a trajectory synthesis module, which uses regularization [Girosi et al. 1993] [Wahba 1990] to synthesize smooth trajectories in MMM space for any specified utterance. The parameters of the trajectory synthesis module are trained automatically from the recorded corpus using gradient descent learning.Recording the video corpus takes on the order of 15 minutes. Processing of the corpus takes on the order of several days, but, apart from the specification of head and eye masks shown in Figure 3, is fully automatic, requiring no intervention on the part of the user. The final visual speech synthesis module consists of a small set of prototype images (46 images in the case presented here) extracted from the recorded corpus and used to synthesize all novel sequences.Application scenarios for videorealistic speech animation include: user-interface agents for desktops, TVs, or cell-phones; digital actors in movies; virtual avatars in ...
We compared persons with dyslexia and normal readers with respect to how well they identified letters and short strings of letters briefly presented in the peripheral visual field at the same time that a single letter was presented at the fixation point of gaze. We found that the dyslexic subjects had a markedly wider area in which correct identification occurred in the peripheral field than did the normal readers. However, the dyslexic subjects had a "masking" between letters in the foveal field and letters in the near periphery. It appears that dyslexic persons learn to read outside the foveal field and, more generally, that there are different learned strategies for task-directed vision. Among such strategies are different mutual interactions between foveal and peripheral vision.
I describe how to create with machine learning techniques a generative, videorealistic, speech animation module. A human subject is first recorded using a videocamera as he/she utters a pre-determined speech corpus. After processing the corpus automatically, a visual speech module is learned from the data that is capable of synthesizing the human subject's mouth uttering entirely novel utterances that were not recorded in the original video. The synthesized utterance is re-composited onto a background sequence which contains natural head and eye movement. The final output is videorealistic in the sense that it looks like a video camera recording of the subject. At run time, the input to the system can be either real audio sequences or synthetic audio produced by a text-to-speech system, as long as they have been phonetically aligned.The two key contributions of this work are
Italian children (n = 125) were classified into dyslexics, poor readers and ordinary readers. The dyslexics were further classified into the Boder and Bakker subtypes. The children were tested with the form-resolving field (FRF), which measures central and peripheral visual recognition. Dyslexics show higher correct identification of letters in the periphery, supporting the notion of a different distribution of lateral masking. A numerical characterization of individual FRFs--C2R--reliably distinguishes between dyslexics and ordinary readers. The wider distribution of recognition, similar across the various subtypes of dyslexia, suggests a general characteristic of visual perception, and possibly a different visual-attentional mode.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.