ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms

Tanaka, Kou; Kameoka, Hirokazu; Kaneko, Takuhiro; Hojo, Nobukatsu

doi:10.1109/icassp.2019.8683282

Cited by 111 publications

(97 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that here the output of our model is still of the same length as the input. Although sequence to sequence based models, which can generate output sequences of variable length, have been successfully applied to VC [20,21,22,23,24], we will show that only considering temporal dependencies can bring significant improvements to VAE-VC.…”

Section: Modeling Time Dependencies With the Fcn Structurementioning

confidence: 93%

Investigation of F0 Conditioning and Fully Convolutional Networks in Variational Autoencoder Based Voice Conversion

Huang

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

In this work, we investigate the effectiveness of two techniques for improving variational autoencoder (VAE) based voice conversion (VC). First, we reconsider the relationship between vocoder features extracted using the high quality vocoders adopted in conventional VC systems, and hypothesize that the spectral features are in fact F0 dependent. Such hypothesis implies that during the conversion phase, the latent codes and the converted features in VAE based VC are in fact source F0 dependent. To this end, we propose to utilize the F0 as an additional input of the decoder. The model can learn to disentangle the latent code from the F0 and thus generates converted F0 dependent converted features. Second, to better capture temporal dependencies of the spectral features and the F0 pattern, we replace the frame wise conversion structure in the original VAE based VC framework with a fully convolutional network structure. Our experiments demonstrate that the degree of disentanglement as well as the naturalness of the converted speech are indeed improved.

show abstract

Section: Modeling Time Dependencies With the Fcn Structurementioning

confidence: 93%

Investigation of F0 Conditioning and Fully Convolutional Networks in Variational Autoencoder Based Voice Conversion

Huang

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

show abstract

“…In this paper we propose an end-to-end architecture that directly generates the target signal, synthesizing it from scratch. It is most similar to recent work on sequence-tosequence voice conversion [16][17][18]. [16] uses a similar end-toend model, conditioned on speaker identities, to transform word segments from multiple speakers into multiple target voices.…”

Section: Introductionmentioning

confidence: 98%

“…However, we do find it helpful to multitask train the model to predict source speech phonemes. Finally, in contrast to [18], we train the model without auxiliary alignment or auto-encoding losses.…”

Section: Introductionmentioning

confidence: 99%

Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation

et al. 2019

View full text Add to dashboard Cite

We describe Parrotron, an end-to-end-trained speech-to-speech conversion model that maps an input spectrogram directly to another spectrogram, without utilizing any intermediate discrete representation. The network is composed of an encoder, spectrogram and phoneme decoders, followed by a vocoder to synthesize a time-domain waveform. We demonstrate that this model can be trained to normalize speech from any speaker regardless of accent, prosody, and background noise, into the voice of a single canonical target speaker with a fixed accent and consistent articulation and prosody. We further show that this normalization model can be adapted to normalize highly atypical speech from a deaf speaker, resulting in significant improvements in intelligibility and naturalness, measured via a speech recognizer and listening tests. Finally, demonstrating the utility of this model on other speech tasks, we show that the same model architecture can be trained to perform a speech separation task.

show abstract

“…For both parallel and non-parallel VC, the systems usually change the voice but are unable to change the duration of the utterance. Many recent researches focused on converting speaking rate along with voices by using sequence-to-sequence models [11,16,17,18], as speaking rate is also a speaker characteristic.…”

Section: Introductionmentioning

confidence: 99%

Bootstrapping Non-Parallel Voice Conversion from Speaker-Adaptive Text-to-Speech

Luong

Yamagishi

2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Voice conversion (VC) and text-to-speech (TTS) are two tasks that share a similar objective, generating speech with a target voice. However, they are usually developed independently under vastly different frameworks. In this paper, we propose a methodology to bootstrap a VC system from a pretrained speaker-adaptive TTS model and unify the techniques as well as the interpretations of these two tasks. Moreover by offloading the heavy data demand to the training stage of the TTS model, our VC system can be built using a small amount of target speaker speech data. It also opens up the possibility of using speech in a foreign unseen language to build the system. Our subjective evaluations show that the proposed framework is able to not only achieve competitive performance in the standard intra-language scenario but also adapt and convert using speech utterances in an unseen language.

show abstract

ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms

Cited by 111 publications

References 51 publications

Investigation of F0 Conditioning and Fully Convolutional Networks in Variational Autoencoder Based Voice Conversion

Investigation of F0 Conditioning and Fully Convolutional Networks in Variational Autoencoder Based Voice Conversion

Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation

Bootstrapping Non-Parallel Voice Conversion from Speaker-Adaptive Text-to-Speech

Contact Info

Product

Resources

About