2019
DOI: 10.48550/arxiv.1912.06813
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining

Abstract: We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining. Seq2seq VC models are attractive owing to their ability to convert prosody. While seq2seq models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have been successfully applied to VC, the use of the Transformer network, which has shown promising results in various speech processing tasks, has not yet been investigated. Non… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
39
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 26 publications
(40 citation statements)
references
References 36 publications
(72 reference statements)
1
39
0
Order By: Relevance
“…Studies also show that voice conversion benefits from the knowledge about linguistic content in the speech. For example, speaker voice conversion successfully leverages TTS [132,20,133] or ASR systems [134,135] that are phoneticallyinformed and trained on large speech corpus.…”
Section: Leveraging Tts or Asr Systemsmentioning
confidence: 99%
See 1 more Smart Citation
“…Studies also show that voice conversion benefits from the knowledge about linguistic content in the speech. For example, speaker voice conversion successfully leverages TTS [132,20,133] or ASR systems [134,135] that are phoneticallyinformed and trained on large speech corpus.…”
Section: Leveraging Tts or Asr Systemsmentioning
confidence: 99%
“…Earlier studies of voice conversion are focused on modeling the mapping between source and target features with some statistical methods, which include Gaussian mixture model (GMM) [9], partial least square regression [10], frequency warping [11] and sparse representation [12,13,14]. Deep learning approaches, such as deep neural network (DNN) [15,16], recurrent neural network (RNN) [17], generative adversarial network (GAN) [18] and sequence-to-sequence model with attention mechanism [19,20] have advanced the state-of-the-art. For effective modeling, parallel training data are required in general.…”
Section: Introductionmentioning
confidence: 99%
“…This toolkit provided Chainer [16] and PyTorch [17]-based neural network libraries and highly reproducible recipes. ESPnet-TTS also contributed to many research projects and development platforms for new applications like voice conversion [18], [19]. However, since the toolkit required a fair amount of offline processing, such as feature extraction and text frontend processing, there existed room for improvement in terms of scalability, flexibility, and portability.…”
Section: Introductionmentioning
confidence: 99%
“…It solves the problem of training the voice conversion model when the dataset is insufficient. Besides, [22,44,45] combines the TTS model with the voice conversion model to solve the problem of difficult training of the voice conversion model when the amount of training data is insufficient. [90,91,54,27] makes the voice conversion module and the TTS module share the decoder to improve the voice conversion model's performance.…”
Section: Introductionmentioning
confidence: 99%
“…The contribution of these works in solving the difficulty of training voice conversion models with insufficient training data and improving voice conversion models' performance is clear. However, these methods still have some problems: 1. the training of some models still requires parallel data sets [90,22]; 2. some methods can only achieve one-to-one or many-to-one voice conversion [90,91,44,45]; 3. the joint training method has an impact on the performance of TTS [90]; 4. reference audio is needed in the synthesis stage [91,27]. These problems bring difficulties to multi-task speech synthesis.…”
Section: Introductionmentioning
confidence: 99%