This paper investigates the use of hidden Markov models (HMM) for Modern Standard Arabic speech synthesis. HMM-based speech synthesis systems require a description of each speech unit with a set of contextual features that specifies phonetic, phonological and linguistic aspects. To apply this method to Arabic language, a study of its particularities was conducted to extract suitable contextual features. Two phenomena are highlighted: vowel quantity and gemination. This work focuses on how to model geminated consonants (resp. long vowels), either considering them as fully-fledged phonemes or as the same phonemes as their simple (resp. short) counterparts but with a different duration. Four modelling approaches have been proposed for this purpose. Results of subjective and objective evaluations show that there is no important difference between differentiating modelling units associated to geminated consonants (resp. long vowels) from modelling units associated to simple consonants (resp. short vowels) and merging them as long as gemination and vowel quantity information is included in the set of features.
In recent years, the performance of speech synthesis systems has been improved thanks to deep learning-based models, but generating expressive audiovisual speech is still an open issue. The variational auto-encoders (VAE)s are recently proposed to learn latent representations of data. In this paper, we present a system for expressive text-to-audiovisual speech synthesis that learns a latent embedding space of emotions using a conditional generative model based on the variational auto-encoder framework. When conditioned on textual input, the VAE is able to learn an embedded representation that captures emotion characteristics from the signal, while being invariant to the phonetic content of the utterances. We applied this method in an unsupervised manner to generate duration, acoustic and visual features of speech. This conditional variational auto-encoder (CVAE) has been used to blend emotions together. This model was able to generate nuances of a given emotion or to generate new emotions that do not exist in our database. We conducted three perceptive experiments to evaluate our findings.
This paper presents a bimodal acoustic-visual synthesis technique that concurrently generates the acoustic speech signal and a 3D animation of the speaker's outer face. This is done by concatenating bimodal diphone units that consist of both acoustic and visual information. In the visual domain, we mainly focus on the dynamics of the face rather than on rendering. The proposed technique overcomes the problems of asynchrony and incoherence inherent in classic approaches to audiovisual synthesis. The different synthesis steps are similar to typical concatenative speech synthesis but are generalized to the acoustic-visual domain. The bimodal synthesis was evaluated using perceptual and subjective evaluations. The overall outcome of the evaluation indicates that the proposed bimodal acoustic-visual synthesis technique provides intelligible speech in both acoustic and visual channels.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.