Variational Autoencoder for Speech Enhancement with a Noise-Aware Encoder

Fang, Huajian; Carbajal, Guillaume; Wermter, Stefan; Gerkmann, Timo

doi:10.1109/icassp39728.2021.9414060

Cited by 41 publications

(17 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The proposed model could also be applied to pitch-informed speech enhancement. Indeed, several recent weakly-supervised speech enhancement methods consist in estimating the VAE latent representation of a clean speech signal given a noisy speech signal (Bando et al, 2018;Leglaive et al, 2018;Sekiguchi et al, 2018;Leglaive et al, 2019b,a;Pariente et al, 2019;Leglaive et al, 2020;Richter et al, 2020;Carbajal et al, 2021;Fang et al, 2021). Using the proposed conditional deep generative speech model, this estimation could be constrained given the f 0 contour computed with a robust f 0 estimation algorithm such as CREPE .…”

Section: Discussionmentioning

confidence: 99%

Learning and controlling the source-filter representation of speech with a variational autoencoder

Sadok,

Leglaive,

Girin

et al. 2022

Preprint

View full text Add to dashboard Cite

Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors, among which the fundamental frequency f 0 and the formants are of primary importance. In this work, we show that the source-filter model of speech production naturally arises in the latent space of a variational autoencoder (VAE) trained in an unsupervised manner on a dataset of natural speech signals. Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we experimentally illustrate that f 0 and the formant frequencies are encoded in orthogonal subspaces of the VAE latent space and we develop a weakly-supervised method to accurately and independently control these speech factors of variation within the learned latent subspaces. Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms that is conditioned on f 0 and the formant frequencies, and which is applied to the transformation of speech signals.

show abstract

Section: Discussionmentioning

confidence: 99%

Learning and controlling the source-filter representation of speech with a variational autoencoder

Sadok,

Leglaive,

Girin

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Thus, they are trained solely to generate clean speech and are therefore considered more robust to different acoustic environments compared to their discriminative counterparts. In fact, generative approaches have shown to perform better under mismatched training and test conditions [8,11,12,13]. However, they are currently less studied and still lag behind discriminative approaches, which is a strong incentive to conduct more research to realize their full potential.…”

Section: Forward Processmentioning

confidence: 99%

“…Instead of learning a direct mapping from noisy to clean speech, generative models aim to learn the distribution of clean speech as a prior for speech enhancement. Several approaches have utilized deep generative models for speech enhancement using generative adversarial networks (GANs) [4], variational autoencoders (VAEs) [5,6,7,8], flow-based models [9], and more recently denoising diffusion probabilistic models (DDPMs) [10,11]. The main principle of these approaches is to learn the inherent properties of clean speech, We acknowledge the support by DASHH (Data Science in Hamburg -HELMHOLTZ Graduate School for the Structure of Matter) with the Grant-No.…”

Section: Introductionmentioning

confidence: 99%

Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain

Simon¹,

Richter²,

Gerkmann³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Score-based generative models (SGMs) have recently shown impressive results for difficult generative tasks such as the unconditional and conditional generation of natural images and audio signals. In this work, we extend these models to the complex short-time Fourier transform (STFT) domain, proposing a novel training task for speech enhancement using a complexvalued deep neural network. We derive this training task within the formalism of stochastic differential equations, thereby enabling the use of predictor-corrector samplers. We provide alternative formulations inspired by previous publications on using SGMs for speech enhancement, avoiding the need for any prior assumptions on the noise distribution and making the training task purely generative which, as we show, results in improved enhancement performance.

show abstract

“…Dennoch können neuronale Netze auf vielfältige Weise in der Medizin eingesetzt werden. Einsatzgebiete sind die Erkennung von Auffälligkeiten im Rahmen der bildgebenden Diagnostik [34] oder die Filterung von Stör-und Hintergrundgeräuschen in Hörgeräten [12]. Aktuelle Projekte beschäftigen sich mit der Erkennung von Gefäßen in Schnittbildgebungen ohne Kontrastmittel, was zu einer Vermeidung von kontrastmittelassoziierten Komplikationen im Rahmen dieser Standardbildgebung führen könnte.…”

Section: Neuronaleunclassified

Privatsphärefreundliches maschinelles Lernen

Stock

Petersen

Behrendt

et al. 2022

Informatik Spektrum

View full text Add to dashboard Cite

ZusammenfassungMaschinelle Lernverfahren finden seit einigen Jahren in immer mehr Bereichen vielfältige Anwendung, wodurch die Relevanz der dabei verwendeten Techniken deutlich wird. Unter dem Begriff des maschinellen Lernens (ML, oft auch „künstliche Intelligenz“) existieren zahlreiche Algorithmen, die unterschiedliche Komplexität und verschiedene Eigenschaften mit sich bringen. Für das Training dieser Algorithmen sind meist große Mengen an Daten notwendig. Insbesondere bei der Verwendung von personenbezogenen Daten stellen sich hierbei Fragen rund um den Datenschutz und die Privatsphäre von Betroffenen.Dies ist der erste Teil eines zweiteiligen Artikels zum Thema privatsphärefreundliches ML. Dieser erste Teil bietet einen leicht verständlichen Einstieg in das Thema des ML und geht dabei auf die wichtigsten Grundbegriffe ein. Außerdem werden einige der meistverwendeten ML-Verfahren, wie Entscheidungsbäume und neuronale Netze, vorgestellt. Im zweiten Teil, der in der kommenden Ausgabe des Informatik Spektrums erscheint, werden Privatsphäreangriffe und datenschutzfördernde Maßnahmen im Kontext von ML behandelt.

show abstract

Variational Autoencoder for Speech Enhancement with a Noise-Aware Encoder

Cited by 41 publications

References 19 publications

Learning and controlling the source-filter representation of speech with a variational autoencoder

Learning and controlling the source-filter representation of speech with a variational autoencoder

Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain

Privatsphärefreundliches maschinelles Lernen

Contact Info

Product

Resources

About