A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion

Niekerk, Benjamin van; Carbonneau, Marc‐André; Zaïdi, Julian; Seuté, Hugo; Kamper, Herman

doi:10.1109/icassp43922.2022.9746484

Cited by 41 publications

(11 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Initially, SSL models were primarily used in speech recognition [6], [7]. Subsequently, similar approach was successfully extended to another speech processing tasks such as language, emotion, speaker recognition [20], [21] and VC [22]. While the real-time scenarios are highly important use cases for ASR models, the streaming scenario is often challenging for such models since SSL pre-training procedure is performed on full-length files without streaming mode adaptation.…”

Section: B Ssl Modelsmentioning

confidence: 99%

Streaming ASR Encoder for Whisper-to-Speech Online Voice Conversion

Avdeeva,

Gusev,

Andzhukaev

et al. 2024

IEEE Open J. Signal Process.

View full text Add to dashboard Cite

Whispered speech is a quiet voice without vocalization. One of the common cases of using whispered speech is a technique that can help overcome stuttering. But whispered speech can be uncomfortable and difficult to understand in everyday communication. To address these problems, we propose a method of low-delayed whisper-to-speech voice conversion, which can be useful in real life communication of people with disordered speech. As part of our research, we study the impact of streaming Automatic Speech Recognition models on the quality of voice conversion, comparing different streaming models and methods for model adaptation to streaming settings, and showing the importance of using such models in cases of low-delayed voice conversion.INDEX TERMS Speech recognition, voice conversion, disordered speech, whisper-to-speech processing. I. INTRODUCTIONDespite the huge progress in developing speech processing tools for various types of disordered speech there is still room for improvement. In this research we concentrate on the stuttering problem. Based on our literature review there are only several works regarding the stuttering problem. These studies discover different aspects such as detecting stuttering type [1], recognizing and even synthesizing [2] stuttering speech. But in this investigation the focus is on a solution which can partially help to control stuttering. According to [3], one of the techniques which can help to overcome stuttering is whispered speech. But whispered speech lacks naturalness due to absence of the fundamental frequency (F0). Thus, we aim to create a system capable of transforming whispers into regular speech and apply this method to real-time processing.The majority of novel voice conversion (VC) systems adopt the following scheme. The whole system usually consists of three parts: an Automatic Speech Recognition (ASR) encoder for phonetic posteriorgrams (PPGs) extraction, a decoder taking PPGs features as input to predict mel spectrograms of target audio and a vocoder for synthesizing audio. The acoustic-phonetic distinctions between whispered and regular speech lead to substantial degradation of ASR systems [4]. However, according to [5] a small set of whispered or pseudo-whispered data used for adaptation brings significant improvements in ASR systems quality. Thus, the model trained on a large amount of speech can be easily adapted to the whispered domain. Also, the recent breakthrough in selfsupervised learning (SSL) allows to obtain well-performing ASR models having only a few hours of labeled data [6]. But, unfortunately, the design of SSL training makes streaming mode challenging for such models.This paper proposes the following contributions:r We demonstrate the ability of HuBERT [7] model pretrained with SSL to work in a streaming mode after an attention context masking or chunk-wise fine-tune training procedure.r We show the importance of using a streaming encoder model to improve the quality of low latency whisper-tospeech VC.r We propose an online VC system adapted to the whi...

show abstract

Section: B Ssl Modelsmentioning

confidence: 99%

Streaming ASR Encoder for Whisper-to-Speech Online Voice Conversion

Avdeeva,

Gusev,

Andzhukaev

et al. 2024

IEEE Open J. Signal Process.

View full text Add to dashboard Cite

show abstract

“…Textless-NLP [34,35] and Au-dioLM [5] do not use text transcriptions or phoneme symbols in speech processing systems; they use discrete units constructed by self-supervised learning. Soft discrete unit is another approach for textless speech processing [57].…”

Section: Textless-nlpmentioning

confidence: 99%

WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions

Rekimoto

2023

Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

Section: Experimental Setup: Unseen Tasks a Evaluation Metricsmentioning

confidence: 99%

“…• Voice conversion: We measure conversion intelligibility following [46], [47], whereby we perform voice conversion and then apply a speech recognition system to the output and compute a character error rate (CER) and F 1 classification score to the word spoken in the original utterance. Speaker similarity is measured as described in [46] whereby we find similarity scores between real/generated utterance pairs using a trained speaker classifier, and then compute an EER with real/generated scores assigned a label of 0 and real/real pair scores assigned a label of 1. • Speech enhancement: Given a series of original clean and noisy utterances, and the models' denoised output, we compute standard measures of denoising performance: narrow-band perceptual evaluation of speech quality (PESQ) [48] and short term objective intelligibility (STOI) scores [49].…”

Section: Experimental Setup: Unseen Tasks a Evaluation Metricsmentioning

confidence: 99%

GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

Kamper

2023

2022 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

Can we develop a model that can synthesize realistic speech directly from a latent space, without explicit conditioning? Despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) -a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space. Building upon the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation which probabilistically skips discriminator updates. We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis. It is also substantially faster than existing top-performing diffusion models. We confirm that ASGAN's latent space is disentangled: we demonstrate how simple linear operations in the space can be used to perform several tasks unseen during training. Specifically, we perform evaluations in voice conversion, speech enhancement, speaker verification, and keyword classification. Our work indicates that GANs are still highly competitive in the unconditional speech synthesis landscape, and that disentangled latent spaces can be used to aid generalization to unseen tasks. Code, models, samples: https://github.com/RF5/simple-asgan/.

show abstract

A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion

Cited by 41 publications

References 42 publications

Streaming ASR Encoder for Whisper-to-Speech Online Voice Conversion

Streaming ASR Encoder for Whisper-to-Speech Online Voice Conversion

WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions

GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

Contact Info

Product

Resources

About