Langzhou Chen scite author profile

Expressive synthesis from text is a challenging problem. There are two issues. First, read text is often highly expressive to convey the emotion and scenario in the text. Second, since the expressive training speech is not always available for different speakers, it is necessary to develop methods to share the expressive information over speakers. This paper investigates the approach of using very expressive, highly diverse audiobook data from multiple speakers to build an expressive speech synthesis system. Both of two problems are addressed by considering a factorized framework where speaker and emotion are modeled in separate sub-spaces of a cluster adaptive training (CAT) parametric speech synthesis system. The sub-spaces for the expressive state of a speaker and the characteristics of the speaker are jointly trained using a set of audiobooks. In this work, the expressive speech synthesis system works in two distinct modes. In the first mode, the expressive information is given by audio data and the adaptation method is used to extract the expressive information in the audio data. In the second mode, the input of the synthesis system is plain text and a full expressive synthesis system is examined where the expressive state is predicted from the text. In both modes, the expressive information is shared and transplanted over different speakers. Experimental results show that in both modes, the expressive speech synthesis method proposed in this work significantly improves the expressiveness of the synthetic speech for different speakers. Finally, this paper also examines whether it is possible to predict the expressive states from text for multiple speakers using a single model, or whether the prediction process needs to be speaker specific.Index Terms-Audiobook, cluster adaptive training, expressive speech synthesis, factorization, hidden Markov model, neural network.

show abstract

Integrated automatic expression prediction and speech synthesis from text

Chen

Gales

Braunschweiler

et al. 2013

View full text Add to dashboard Cite

Using information retrieval methods for language model adaptation

Chen¹,

Gauvain²,

Lamel³

et al. 2001

View full text Add to dashboard Cite

Robust excitation-based features for Automatic Speech Recognition

Drugman

Stylianou

Chen

et al. 2015

View full text Add to dashboard Cite

In this paper we investigate the use of noise-robust features characterizing the speech excitation signal as complementary features to the usually considered vocal tract based features for Automatic Speech Recognition (ASR). The proposed Excitation-based Features (EBF) are tested in a state-of-theart Deep Neural Network (DNN) based hybrid acoustic model for speech recognition. The suggested excitation features expand the set of periodicity features previously considered for ASR, expecting that these features help in a better discrimination of the broad phonetic classes (e.g., fricatives, nasal, vowels, etc.). Our experiments on the AMI meeting transcription system showed that the proposed EBF yield a relative word error rate reduction of about 5% when combined with conventional PLP features. Further experiments led on Aurora4 confirmed the robustness of the EBF to both additive and convolutive noises, with a relative improvement of 4.3% obtained by combinining them with mel filter banks.

show abstract

Building HMM-TTS Voices on Diverse Data

Wan

Latorre

Yanagisawa

et al. 2014

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

Constrained discriminative mapping transforms for unsupervised speaker adaptation

Chen

Gales

Chin

2011

View full text Add to dashboard Cite

Acoustic Model Bootstrapping Using Semi-Supervised Learning

Chen¹,

Leutnant

2019

View full text Add to dashboard Cite

This work aims at bootstrapping acoustic model training for automatic speech recognition with small amounts of humanlabeled speech data and large amounts of machine-labeled speech data.Semi-supervised learning is investigated to select the machine-transcribed training samples.Two semi-supervised learning methods are proposed: one is the local-global uncertainty based method which introduces both the local uncertainty from the current utterance and the global uncertainty from the whole data pool into the data selection; the other is the margin based data selection, which selects the utterances near to the decision boundary through language model tuning. The experimental results based on a Japanese far-field automatic speech recognition system indicate that the acoustic model trained by automatically transcribed speech data achieve about 17% relative gain when in-domain human annotated data was not available for initialization. While 3.7% relative gain was obtained when the initial acoustic model was trained by small amount of in-domain data.

show abstract

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Langzhou Chen

Integrated Expression Prediction and Speech Synthesis From Text

Speaker and Expression Factorization for Audiobook Data: Expressiveness and Transplantation

Integrated automatic expression prediction and speech synthesis from text

Using information retrieval methods for language model adaptation

Robust excitation-based features for Automatic Speech Recognition

Building HMM-TTS Voices on Diverse Data

Constrained discriminative mapping transforms for unsupervised speaker adaptation

Acoustic Model Bootstrapping Using Semi-Supervised Learning

Contact Info

Product

Resources

About