Investigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis

Lorenzo-Trueba, Jaime; Henter, Gustav Eje; Takaki, Shinji; Yamagishi, Junichi; Morino, Yosuke; Ochiai, Y.

doi:10.1016/j.specom.2018.03.002

Cited by 77 publications

(58 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It has been shown that only ∼5 min of speech per style is sufficient in order to produce speech of acceptable quality in a specific style. Using input codes for representing different styles is also presented in [119, 120]. There have also been attempts at style transplantation, i.e., producing speech in the voice of speaker A in style X without having any sentence from speaker A in style X in the training data, in which case the network is forced to learn the style X from other speakers in the training database [121, 122].…”

Section: Progress In Speech Recognition and Synthesis As Well As mentioning

confidence: 99%

Speech Technology Progress Based on New Machine Learning Paradigm

Delić

Perić

Secujski

et al. 2019

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

Speech technologies have been developed for decades as a typical signal processing area, while the last decade has brought a huge progress based on new machine learning paradigms. Owing not only to their intrinsic complexity but also to their relation with cognitive sciences, speech technologies are now viewed as a prime example of interdisciplinary knowledge area. This review article on speech signal analysis and processing, corresponding machine learning algorithms, and applied computational intelligence aims to give an insight into several fields, covering speech production and auditory perception, cognitive aspects of speech communication and language understanding, both speech recognition and text-to-speech synthesis in more details, and consequently the main directions in development of spoken dialogue systems. Additionally, the article discusses the concepts and recent advances in speech signal compression, coding, and transmission, including cognitive speech coding. To conclude, the main intention of this article is to highlight recent achievements and challenges based on new machine learning paradigms that, over the last decade, had an immense impact in the field of speech signal processing.

show abstract

Section: Progress In Speech Recognition and Synthesis As Well As mentioning

confidence: 99%

Speech Technology Progress Based on New Machine Learning Paradigm

Delić

Perić

Secujski

et al. 2019

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

show abstract

“…The proposed speaker representation learning algorithms extend these ideas to make DNNs learn the pairwise speakers' perceptual similarity rather than the conventional pointwise speaker's voice impression. Furthermore, one can model the relationship between a speaker's intention and listener's perception (e.g., difference in emotion perception [37]) by using the algorithms. Also, we can use the proposed speaker embeddings in more sophisticated speech synthesis frameworks, such as end-to-end multi-speaker TTS [21], multi-speaker multi-lingual TTS [38], and singing VC [39], instead of the conventional discriminative speaker embeddings.…”

Section: E Discussionmentioning

confidence: 99%

Perceptual-Similarity-Aware Deep Speaker Representation Learning for Multi-Speaker Generative Modeling

Saito

Takamichi

Saruwatari

2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

We propose novel deep speaker representation learning that considers perceptual similarity among speakers for multi-speaker generative modeling. Following its success in accurate discriminative modeling of speaker individuality, knowledge of deep speaker representation learning (i.e., speaker representation learning using deep neural networks) has been introduced to multi-speaker generative modeling. However, the conventional discriminative algorithm does not necessarily learn speaker embeddings suitable for such generative modeling, which may result in lower quality and less controllability of synthetic speech. We propose three representation learning algorithms that utilize a perceptual speaker similarity matrix obtained by large-scale perceptual scoring of speaker-pair similarity. The algorithms train a speaker encoder to learn speaker embeddings with three different representations of the matrix: a set of vectors, the Gram matrix, and a graph. Furthermore, we propose an active learning algorithm that iterates the perceptual scoring and speaker encoder training. To obtain accurate embeddings while reducing costs of scoring and training, the algorithm selects unscored speaker-pairs to be scored next on the basis of the sequentially-trained speaker encoder's similarity prediction results. Experimental evaluation results show that 1) the proposed representation learning algorithms learn speaker embeddings strongly correlated with perceptual speaker-pair similarity, 2) the embeddings improve synthetic speech quality in speech autoencoding tasks better than conventional d-vectors learned by discriminative modeling, 3) the proposed active learning algorithm achieves higher synthetic speech quality while reducing costs of scoring and training, and 4) among the proposed similarity {vector, matrix, graph} embedding algorithms, the first achieves the best speaker similarity for synthetic speech and the third gives the most improvement in the synthetic speech naturalness.

show abstract

“…Many pioneering methods have been proposed for emotional TTS. [4] proposes a LSTM-based acoustic model for emotional TTS, where several kinds of emotional category labels such as one-hot vector or perception vector are used as an extra input to the acoustic model. [5] uses a improved tacotron [1] model for end-to-end emotional TTS, in which the emotion labels are concatenated to the output of both the decoder pre-net and the first decoder RNN layer.…”

Section: Introductionmentioning

confidence: 99%

Emotion Controllable Speech Synthesis Using Emotion-Unlabeled Dataset with the Assistance of Cross-Domain Speech Emotion Recognition

Cai

Dai

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Neural text-to-speech (TTS) approaches generally require a huge number of high quality speech data, which makes it difficult to obtain such a dataset with extra emotion labels. In this paper, we propose a novel approach for emotional TTS synthesis on a TTS dataset without emotion labels. Specifically, our proposed method consists of a cross-domain speech emotion recognition (SER) model and an emotional TTS model. Firstly, we train the cross-domain SER model on both SER and TTS datasets. Then, we use emotion labels on the TTS dataset predicted by the trained SER model to build an auxiliary SER task and jointly train it with the TTS model. Experimental results show that our proposed method can generate speech with the specified emotional expressiveness and nearly no hindering on the speech quality.

show abstract

Investigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis

Cited by 77 publications

References 26 publications

Speech Technology Progress Based on New Machine Learning Paradigm

Speech Technology Progress Based on New Machine Learning Paradigm

Perceptual-Similarity-Aware Deep Speaker Representation Learning for Multi-Speaker Generative Modeling

Emotion Controllable Speech Synthesis Using Emotion-Unlabeled Dataset with the Assistance of Cross-Domain Speech Emotion Recognition

Contact Info

Product

Resources

About