2020
DOI: 10.1109/taslp.2019.2960721
|View full text |Cite
|
Sign up to set email alerts
|

Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations

Abstract: This paper presents a method of sequence-tosequence (seq2seq) voice conversion using non-parallel training data. In this method, disentangled linguistic and speaker representations are extracted from acoustic features, and voice conversion is achieved by preserving the linguistic representations of source utterances while replacing the speaker representations with the target ones. Our model is built under the framework of encoder-decoder neural networks. A recognition encoder is designed to learn the disentang… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
95
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 97 publications
(97 citation statements)
references
References 48 publications
(68 reference statements)
2
95
0
Order By: Relevance
“…Meanwhile, given the recent success of the sequence-tosequence (S2S) learning framework in various tasks, several VC methods based on S2S models have been proposed, including the ones we proposed previously [51]- [54]. While S2S models usually require parallel corpora for training, an attempt has also been made to train an S2S model using non-parallel utterances [55]. However, it requires phoneme transcriptions as auxiliary information for model training.…”
Section: Related Workmentioning
confidence: 99%
“…Meanwhile, given the recent success of the sequence-tosequence (S2S) learning framework in various tasks, several VC methods based on S2S models have been proposed, including the ones we proposed previously [51]- [54]. While S2S models usually require parallel corpora for training, an attempt has also been made to train an S2S model using non-parallel utterances [55]. However, it requires phoneme transcriptions as auxiliary information for model training.…”
Section: Related Workmentioning
confidence: 99%
“…From this challenge, we observed that new speech waveform generation paradigms such as WaveNet and phone encoding have brought significant progress to the voice conversion field. Further improvements have been achieved in the follow up papers [295], [296] and new VC systems that exceed the challenge's best performance have already been reported.…”
Section: Overview Of the 2018 Voice Conversion Challengementioning
confidence: 98%
“…The proposed framework improves the stability of the generated speech but does not add any benefits in term of data efficiency, as parallel speech data of the source and target speakers is still a requirement. Recently, Jing-Xuan Zhang et al [18] have introduced a non-parallel sequence-tosequence voice conversion system that has a procedure that is similar to our: the initial model is trained with a TTS-like network to help disentangle linguistic representation and the model then adapted to the source and target speakers. However, the adaptation step of the framework proposed in [18] requires both source and target speaker speech as well as their transcripts, which increases the data demand for building a VC system with a particular voice.…”
Section: Related Workmentioning
confidence: 99%
“…For both parallel and non-parallel VC, the systems usually change the voice but are unable to change the duration of the utterance. Many recent researches focused on converting speaking rate along with voices by using sequence-to-sequence models [11,16,17,18], as speaking rate is also a speaker characteristic.…”
Section: Introductionmentioning
confidence: 99%