Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-247
|View full text |Cite
|
Sign up to set email alerts
|

Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities

Abstract: Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed. Conventional VC using shared context posterior probabilities predicts target speech parameters from the context posterior probabilities estimated from the source speech parameters. Although conventional VC can be built from non-parallel data, it is difficult to convert speaker individuality such as phonetic property and speaking rate contained in the posterior probabilities because the source posterior pro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
45
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 56 publications
(46 citation statements)
references
References 14 publications
1
45
0
Order By: Relevance
“…Note that here the output of our model is still of the same length as the input. Although sequence to sequence based models, which can generate output sequences of variable length, have been successfully applied to VC [20,21,22,23,24], we will show that only considering temporal dependencies can bring significant improvements to VAE-VC.…”
Section: Modeling Time Dependencies With the Fcn Structurementioning
confidence: 92%
“…Note that here the output of our model is still of the same length as the input. Although sequence to sequence based models, which can generate output sequences of variable length, have been successfully applied to VC [20,21,22,23,24], we will show that only considering temporal dependencies can bring significant improvements to VAE-VC.…”
Section: Modeling Time Dependencies With the Fcn Structurementioning
confidence: 92%
“…Recently, deep learning has changed the above standard procedures for voice conversion and we can see many different solutions now. For instance, variational auto-encoder or sequence-to-sequence neural networks enable us to build VC systems without using frame level alignment [102,103]. It has also been showed that a cycle-consistent adversarial network called "CycleGAN" [104] is one possible solution for building VC systems without using a parallel corpus.…”
Section: Voice Conversionmentioning
confidence: 99%
“…VC aims to convert the non-linguistic information of the speech signals while maintaining the linguistic content the same. The non-linguistic information may refer to speaker identity [1,2,3], accent or pronunciation [4,5] to name a few. VC can be useful in some down-stream tasks like multi-speaker text-tospeech [6,7] and expressive speech synthesis [8,9], and also some applications like speech enhancement [10,11,12] or pronunciation correction [4], and so on.…”
Section: Introductionmentioning
confidence: 99%