Speaker Generation

Stanton, Daisy; Shannon, Matt; Mariooryad, Soroosh; Skerry-Ryan, RJ; Battenberg, Eric; Bagby, Tom; Kao, David T. H.

doi:10.1109/icassp43922.2022.9747345

Cited by 16 publications

(10 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As is often shown in the literature [3], gender is one of the dominant sources of variation in speech. The strength of each dimension's association with gender in the embedded space can be found using their correlation ratio, η, calculated by dividing the weighted variance of the mean of each category (male/female) by the variance of all samples.…”

Section: Exploring the Speaker Embeddings Spacementioning

confidence: 85%

See 1 more Smart Citation

Generating Gender-Ambiguous Text-to-Speech Voices

Markopoulos¹,

Georgia²,

Vamvoukakis³

et al. 2022

Preprint

View full text Add to dashboard Cite

The gender of a voice assistant or any voice user interface is a central element of its perceived identity. While a female voice is a common choice, there is an increasing interest in alternative approaches where the gender is ambiguous rather than clearly identifying as female or male. This work addresses the task of generating gender-ambiguous text-to-speech (TTS) voices that do not correspond to any existing person. This is accomplished by sampling from a latent speaker embeddings' space that was formed while training a multilingual, multispeaker TTS system on data from multiple male and female speakers. Various options are investigated regarding the sampling process. In our experiments, the effects of different sampling choices on the gender ambiguity and the naturalness of the resulting voices are evaluated. The proposed method is shown able to efficiently generate novel speakers that are superior to a baseline averaged speaker embedding. To our knowledge, this is the first systematic approach that can reliably generate a range of gender-ambiguous voices to meet diverse user requirements.

show abstract

Section: Exploring the Speaker Embeddings Spacementioning

confidence: 85%

“…The speaker generation task has been introduced very recently by Stanton et al [3]. In their work, they train a multi-speaker Tacotron model by using learnable speaker embeddings and create a speaker embedding prior to model the distribution over the speaker embedding space.…”

Section: Related Workmentioning

confidence: 99%

Generating Gender-Ambiguous Text-to-Speech Voices

Markopoulos¹,

Georgia²,

Vamvoukakis³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…We adapt the metric s2t-same from [36] which measures how similar synthesized audio from a synthesized speaker is to ground truth audio from the same speaker. While in the original context, this metric was used for speakers of the training dataset, here we use it to measure the speaker fidelity for unseen speakers.…”

Section: Objective Evaluationmentioning

confidence: 99%

Self supervised learning for robust voice cloning

Klapsas¹,

Ellinas²,

Nikitaras³

et al. 2022

Preprint

View full text Add to dashboard Cite

Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are applied to the vanilla algorithm. We further extend the augmentations in the training procedure to aid the resulting features to capture the speaker identity and to make them robust to noise and acoustic conditions. The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture, aiming to achieve multispeaker speech synthesis without utilizing additional speaker features. This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice. Subjective and objective evaluations are used to validate the proposed model, as well as the robustness to the acoustic conditions of the target utterance.

show abstract

“…Inspired by the recently introduced task of speaker generation [62], we introduce a methodology to use VC models for speaker anonymization without the need for specifying a target speaker. Although we discuss this approach in the context of LVC-VC, it can feasibly be used to extend the capabilities of any VC model that incorporates a speaker encoder.…”

Section: E Extension: Un-targeted Speaker Anonymizationmentioning

confidence: 99%

End-to-End Zero-Shot Voice Style Transfer with Location-Variable Convolutions

Kang¹,

Hasegawa‐Johnson²,

Roy³

2022

Preprint

View full text Add to dashboard Cite

Zero-shot voice conversion is becoming an increasingly popular research direction, as it promises the ability to transform speech to match the voice style of any speaker. However, little work has been done on end-to-end methods for this task, which are appealing because they remove the need for a separate vocoder to generate audio from intermediate features. In this work, we propose Location-Variable Convolution-based Voice Conversion (LVC-VC), a model for performing end-to-end zero-shot voice conversion that is based on a neural vocoder. LVC-VC utilizes carefully designed input features that have disentangled content and speaker style information, and the vocoder-like architecture learns to combine them to simultaneously perform voice conversion while synthesizing audio. To the best of our knowledge, LVC-VC is one of the first models to be proposed that can perform zero-shot voice conversion in an end-to-end manner, and it is the first to do so using a vocoder-like neural framework. Experiments show that our model achieves competitive or better voice style transfer performance compared to several baselines while maintaining the intelligibility of transformed speech much better.Preprint. Under review.

show abstract

Speaker Generation

Cited by 16 publications

References 13 publications

Generating Gender-Ambiguous Text-to-Speech Voices

Generating Gender-Ambiguous Text-to-Speech Voices

Self supervised learning for robust voice cloning

End-to-End Zero-Shot Voice Style Transfer with Location-Variable Convolutions

Contact Info

Product

Resources

About