2019
DOI: 10.1109/taslp.2019.2892235
|View full text |Cite
|
Sign up to set email alerts
|

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
117
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 136 publications
(117 citation statements)
references
References 32 publications
0
117
0
Order By: Relevance
“…Mel-cepstrum distortion (MCD), root of mean square errors of F 0 (F 0 RMSE), the error rate of voicing/unvoicing flags (VUV) and the Pearson correlation factor of F 0 (F 0 CORR) were used as the metrics for objective evaluation. In order to investigate the effects of duration modification, we also computed the average absolute differences between the durations of the converted and target utterances (DDUR) as in our previous work [18]. When computing DDUR, the silence segments at the beginning and the end of utterances were removed.…”
Section: Objective Evaluationsmentioning
confidence: 99%
See 2 more Smart Citations
“…Mel-cepstrum distortion (MCD), root of mean square errors of F 0 (F 0 RMSE), the error rate of voicing/unvoicing flags (VUV) and the Pearson correlation factor of F 0 (F 0 CORR) were used as the metrics for objective evaluation. In order to investigate the effects of duration modification, we also computed the average absolute differences between the durations of the converted and target utterances (DDUR) as in our previous work [18]. When computing DDUR, the silence segments at the beginning and the end of utterances were removed.…”
Section: Objective Evaluationsmentioning
confidence: 99%
“…The forms of the acoustic models for VC included joint density Gaussian mixture models (JD-GMMs) [3], [7], [8] neural networks (DNNs) [9]- [11], recurrent neural networks (RNNs) [12], [13], and so on. Recently, sequence-to-sequence (seq2seq) neural networks [14]- [17] have also been applied to VC, which achieved higher naturalness and similarity than conventional frame-aligned conversion [18]- [20].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In contrast to S2ST, the input-output alignment for voice conversion is simpler and approximately monotonic. [23] also trains models that are specific to each input-output speaker pair (i.e. one-toone conversion), whereas we explore many-to-one and manyto-many speaker configurations.…”
Section: Introductionmentioning
confidence: 99%
“…Note that here the output of our model is still of the same length as the input. Although sequence to sequence based models, which can generate output sequences of variable length, have been successfully applied to VC [20,21,22,23,24], we will show that only considering temporal dependencies can bring significant improvements to VAE-VC.…”
Section: Modeling Time Dependencies With the Fcn Structurementioning
confidence: 93%