Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining

Huang, Wen-Chin; Hayashi, Tomoki; Wu, Yi-Chiao; Kameoka, Hirokazu; Toda, Tomoki

doi:10.48550/arxiv.1912.06813

Cited by 26 publications

(40 citation statements)

References 36 publications

(72 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Studies also show that voice conversion benefits from the knowledge about linguistic content in the speech. For example, speaker voice conversion successfully leverages TTS [132,20,133] or ASR systems [134,135] that are phoneticallyinformed and trained on large speech corpus.…”

Section: Leveraging Tts or Asr Systemsmentioning

confidence: 99%

“…Earlier studies of voice conversion are focused on modeling the mapping between source and target features with some statistical methods, which include Gaussian mixture model (GMM) [9], partial least square regression [10], frequency warping [11] and sparse representation [12,13,14]. Deep learning approaches, such as deep neural network (DNN) [15,16], recurrent neural network (RNN) [17], generative adversarial network (GAN) [18] and sequence-to-sequence model with attention mechanism [19,20] have advanced the state-of-the-art. For effective modeling, parallel training data are required in general.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Emotional Voice Conversion: Theory, Databases and ESD

Zhou¹,

Şişman²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we first provide a review of the state-of-the-art emotional voice conversion research, and the existing emotional speech databases. We then motivate the development of a novel emotional speech database (ESD) that addresses the increasing research need. With this paper, the ESD database 1 is now made available to the research community. The ESD database consists of 350 parallel utterances spoken by 10 native English and 10 native Chinese speakers and covers 5 emotion categories (neutral, happy, angry, sad and surprise). More than 29 hours of speech data were recorded in a controlled acoustic environment. The database is suitable for multi-speaker and cross-lingual emotional voice conversion studies. As case studies, we implement several state-of-the-art emotional voice conversion systems on the ESD database. This paper provides a reference study on ESD in conjunction with its release.

show abstract

Section: Leveraging Tts or Asr Systemsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Emotional Voice Conversion: Theory, Databases and ESD

Zhou¹,

Şişman²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…This toolkit provided Chainer [16] and PyTorch [17]-based neural network libraries and highly reproducible recipes. ESPnet-TTS also contributed to many research projects and development platforms for new applications like voice conversion [18], [19]. However, since the toolkit required a fair amount of offline processing, such as feature extraction and text frontend processing, there existed room for improvement in terms of scalability, flexibility, and portability.…”

Section: Introductionmentioning

confidence: 99%

ESPnet2-TTS: Extending the Edge of TTS Research

Hayashi,

Yamamoto,

Yoshimura

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-thefly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E textto-waveform modeling, which simplify the training pipeline and further enhance TTS performance. The unified design of our recipes enables users to quickly reproduce state-of-the-art E2E-TTS results. We also provide many pre-trained models in a unified Python interface for inference, offering a quick means for users to generate baseline samples and build demos. Experimental evaluations with English and Japanese corpora demonstrate that our provided models synthesize utterances comparable to ground-truth ones, achieving state-of-the-art TTS performance. The toolkit is available online at https://github.com/espnet/espnet.

show abstract

“…It solves the problem of training the voice conversion model when the dataset is insufficient. Besides, [22,44,45] combines the TTS model with the voice conversion model to solve the problem of difficult training of the voice conversion model when the amount of training data is insufficient. [90,91,54,27] makes the voice conversion module and the TTS module share the decoder to improve the voice conversion model's performance.…”

Section: Introductionmentioning

confidence: 99%

“…The contribution of these works in solving the difficulty of training voice conversion models with insufficient training data and improving voice conversion models' performance is clear. However, these methods still have some problems: 1. the training of some models still requires parallel data sets [90,22]; 2. some methods can only achieve one-to-one or many-to-one voice conversion [90,91,44,45]; 3. the joint training method has an impact on the performance of TTS [90]; 4. reference audio is needed in the synthesis stage [91,27]. These problems bring difficulties to multi-task speech synthesis.…”

Section: Introductionmentioning

confidence: 99%

MASS: Multi-task Anthropomorphic Speech Synthesis Framework

Chen

Ming

2021

Preprint

View full text Add to dashboard Cite

Text-to-Speech (TTS) synthesis plays an important role in human-computer interaction. Currently, most TTS technologies focus on the naturalness of speech, namely, making the speeches sound like humans. However, the key tasks of the expression of emotion and the speaker identity are ignored, which limits the application scenarios of TTS synthesis technology. To make the synthesized speech more realistic and

show abstract

Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining

Cited by 26 publications

References 36 publications

Emotional Voice Conversion: Theory, Databases and ESD

Emotional Voice Conversion: Theory, Databases and ESD

ESPnet2-TTS: Extending the Edge of TTS Research

MASS: Multi-task Anthropomorphic Speech Synthesis Framework

Contact Info

Product

Resources

About