Sequence-to-Sequence Acoustic Modeling for Voice Conversion

Zhang, Jing-Xuan; Ling, Zhen-Hua; Liu, Lijuan; Jiang, Yuan; Dai, Li-Rong

doi:10.1109/taslp.2019.2892235

Cited by 136 publications

(117 citation statements)

References 32 publications

Supporting

Mentioning

117

Contrasting

Order By: Relevance

“…Mel-cepstrum distortion (MCD), root of mean square errors of F 0 (F 0 RMSE), the error rate of voicing/unvoicing flags (VUV) and the Pearson correlation factor of F 0 (F 0 CORR) were used as the metrics for objective evaluation. In order to investigate the effects of duration modification, we also computed the average absolute differences between the durations of the converted and target utterances (DDUR) as in our previous work [18]. When computing DDUR, the silence segments at the beginning and the end of utterances were removed.…”

Section: Objective Evaluationsmentioning

confidence: 99%

“…The forms of the acoustic models for VC included joint density Gaussian mixture models (JD-GMMs) [3], [7], [8] neural networks (DNNs) [9]- [11], recurrent neural networks (RNNs) [12], [13], and so on. Recently, sequence-to-sequence (seq2seq) neural networks [14]- [17] have also been applied to VC, which achieved higher naturalness and similarity than conventional frame-aligned conversion [18]- [20].…”

Section: Introductionmentioning

confidence: 99%

“…Second, the ASR model is usually trained with a phoneme classification loss and lacks explicit consideration on disentangling linguistic and speaker representations. Third, most of these methods follow the framework of frame-byframe conversion and can not achieve the advantages of seq2seq modeling [18], such as duration modification.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations

Zhang

Ling

Dai

2020

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

This paper presents a method of sequence-tosequence (seq2seq) voice conversion using non-parallel training data. In this method, disentangled linguistic and speaker representations are extracted from acoustic features, and voice conversion is achieved by preserving the linguistic representations of source utterances while replacing the speaker representations with the target ones. Our model is built under the framework of encoder-decoder neural networks. A recognition encoder is designed to learn the disentangled linguistic representations with two strategies. First, phoneme transcriptions of training data are introduced to provide the references for leaning linguistic representations of audio signals. Second, an adversarial training strategy is employed to further wipe out speaker information from the linguistic representations. Meanwhile, speaker representations are extracted from audio signals by a speaker encoder. The model parameters are estimated by two-stage training, including a pretraining stage using a multi-speaker dataset and a fine-tuning stage using the dataset of a specific conversion pair. Since both the recognition encoder and the decoder for recovering acoustic features are seq2seq neural networks, there are no constrains of frame alignment and frame-by-frame conversion in our proposed method. Experimental results showed that our method obtained higher similarity and naturalness than the best non-parallel voice conversion method in Voice Conversion Challenge 2018. Besides, the performance of our proposed method was closed to the stateof-the-art parallel seq2seq voice conversion method.

show abstract

Section: Objective Evaluationsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations

Zhang

Ling

Dai

2020

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

show abstract

“…In contrast to S2ST, the input-output alignment for voice conversion is simpler and approximately monotonic. [23] also trains models that are specific to each input-output speaker pair (i.e. one-toone conversion), whereas we explore many-to-one and manyto-many speaker configurations.…”

Section: Introductionmentioning

confidence: 99%

Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model

Jia¹,

Weiss²,

Biadsy³

et al. 2019

Interspeech 2019

115

127

View full text Add to dashboard Cite

We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. The network is trained end-to-end, learning to map speech spectrograms into target spectrograms in another language, corresponding to the translated content (in a different canonical voice). We further demonstrate the ability to synthesize translated speech using the voice of the source speaker. We conduct experiments on two Spanish-to-English speech translation datasets, and find that the proposed model slightly underperforms a baseline cascade of a direct speech-to-text translation model and a text-to-speech synthesis model, demonstrating the feasibility of the approach on this very challenging task.

show abstract

“…Note that here the output of our model is still of the same length as the input. Although sequence to sequence based models, which can generate output sequences of variable length, have been successfully applied to VC [20,21,22,23,24], we will show that only considering temporal dependencies can bring significant improvements to VAE-VC.…”

Section: Modeling Time Dependencies With the Fcn Structurementioning

confidence: 93%

Investigation of F0 Conditioning and Fully Convolutional Networks in Variational Autoencoder Based Voice Conversion

Huang

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

In this work, we investigate the effectiveness of two techniques for improving variational autoencoder (VAE) based voice conversion (VC). First, we reconsider the relationship between vocoder features extracted using the high quality vocoders adopted in conventional VC systems, and hypothesize that the spectral features are in fact F0 dependent. Such hypothesis implies that during the conversion phase, the latent codes and the converted features in VAE based VC are in fact source F0 dependent. To this end, we propose to utilize the F0 as an additional input of the decoder. The model can learn to disentangle the latent code from the F0 and thus generates converted F0 dependent converted features. Second, to better capture temporal dependencies of the spectral features and the F0 pattern, we replace the frame wise conversion structure in the original VAE based VC framework with a fully convolutional network structure. Our experiments demonstrate that the degree of disentanglement as well as the naturalness of the converted speech are indeed improved.

show abstract

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

Cited by 136 publications

References 32 publications

Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations

Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations

Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model

Investigation of F0 Conditioning and Fully Convolutional Networks in Variational Autoencoder Based Voice Conversion

Contact Info

Product

Resources

About