Silent communication based on biosignals
from facial muscle requires accurate detection of its directional
movement and thus optimally positioning minimum numbers of sensors
for higher accuracy of speech recognition with a minimal person-to-person
variation. So far, previous approaches based on electromyogram or
pressure sensors are ineffective in detecting the directional movement
of facial muscles. Therefore, in this study, high-performance strain
sensors are used for separately detecting x- and y-axis strain. Directional strain distribution data of facial
muscle is obtained by applying three-dimensional digital image correlation.
Deep learning analysis is utilized for identifying optimal positions
of directional strain sensors. The recognition system with four directional
strain sensors conformably attached to the face shows silent vowel
recognition with 85.24% accuracy and even 76.95% for completely nonobserved
subjects. These results show that detection of the directional strain
distribution at the optimal facial points will be the key enabling
technology for highly accurate silent speech recognition.
We propose Guided-TTS 2, a diffusion-based generative model for high-quality adaptive TTS using untranscribed data. Guided-TTS 2 combines a speakerconditional diffusion model with a speaker-dependent phoneme classifier for adaptive text-to-speech. We train the speaker-conditional diffusion model on large-scale untranscribed datasets for a classifier-free guidance method and further fine-tune the diffusion model on the reference speech of the target speaker for adaptation, which only takes 40 seconds. We demonstrate that Guided-TTS 2 shows comparable performance to high-quality single-speaker TTS baselines in terms of speech quality and speaker similarity with only a ten-second untranscribed data. We further show that Guided-TTS 2 outperforms adaptive TTS baselines on multi-speaker datasets even with a zero-shot adaptation setting. Guided-TTS 2 can adapt to a wide range of voices only using untranscribed speech, which enables adaptive TTS with the voice of non-human characters such as Gollum in "The Lord of the Rings".
of the unedited regions. We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.