Guided Variational Autoencoder for Speech Enhancement with a Supervised Classifier

Carbajal, Guillaume; Richter, Julius; Gerkmann, Timo

doi:10.1109/icassp39728.2021.9414363

Cited by 16 publications

(11 citation statements)

References 19 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The VAE can be conditioned on a label yn ∈ Y describing a speech attribute (e.g. speech activity) that allows for a more explicit control of speech generation [13]. A common approach is to make use of the label yn by directly inputting it in both the encoder E φ,z (|sn| 2 , yn) and the decoder D θ (zn, yn) (see Fig 1b) [11,12,13].…”

Section: Conditional Vaementioning

confidence: 99%

See 1 more Smart Citation

Disentanglement Learning for Variational Autoencoders Applied to Audio-Visual Speech Enhancement

Carbajal

Richter

Gerkmann

2021

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

Self Cite

View full text Add to dashboard Cite

Recently, the standard variational autoencoder has been successfully used to learn a probabilistic prior over speech signals, which is then used to perform speech enhancement. Variational autoencoders have then been conditioned on a label describing a high-level speech attribute (e.g. speech activity) that allows for a more explicit control of speech generation. However, the label is not guaranteed to be disentangled from the other latent variables, which results in limited performance improvements compared to the standard variational autoencoder. In this work, we propose to use an adversarial training scheme for variational autoencoders to disentangle the label from the other latent variables. At training, we use a discriminator that competes with the encoder of the variational autoencoder. Simultaneously, we also use an additional encoder that estimates the label for the decoder of the variational autoencoder, which proves to be crucial to learn disentanglement. We show the benefit of the proposed disentanglement learning when a voice activity label, estimated from visual data, is used for speech enhancement.

show abstract

Section: Conditional Vaementioning

confidence: 99%

“…tion [10]. For various speech-related tasks, VAEs have been conditioned on a label describing a speech attribute, such as speaker identity [11,12], phoneme [12] or speech activity [13]. Ideally, the label should be independent from the other latent dimensions to obtain an explicit control of speech generation.…”

Section: Introductionmentioning

confidence: 99%

Disentanglement Learning for Variational Autoencoders Applied to Audio-Visual Speech Enhancement

Carbajal

Richter

Gerkmann

2021

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

Self Cite

View full text Add to dashboard Cite

show abstract

“…This approach is computationally efficient since it does not require sampling or gradient descent at each step of the algorithm. More recently, a guided VAE was proposed in [36], where the VAE-based clean speech signal prior is defined conditionally on the voice activity detection or the ideal binary mask. This guiding information has to be provided by a supervised classifier, separately trained on noisy speech signals.…”

Section: Related Workmentioning

confidence: 99%

“…where Σ θs,t = Σ θs,t (s 1:t−1 , z1:t ) and z1:t is sampled from p θ (z 1:t |x 1:T ). In practice, this posterior distribution is also intractable, we thus propose a variational approximation q φ (z 1:T |x 1:T ) whose parameters φ need to be jointly estimated together with the noisy mixture model parameters ϕ, in order to compute the speech estimate in (36). As detailed in the next section, we propose a VEM algorithm to do that.…”

Section: B Speech Reconstructionmentioning

confidence: 99%

See 1 more Smart Citation

Unsupervised Speech Enhancement using Dynamical Variational Auto-Encoders

Bie,

Leglaive,

Alameda-Pineda

et al. 2021

Preprint

View full text Add to dashboard Cite

Dynamical variational auto-encoders (DVAEs) are a class of deep generative models with latent variables, dedicated to time series data modeling. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include the modeling of temporal dependencies between successive observed and/or latent vectors in data sequences. Previous work has shown the interest of DVAEs and their better performance over the VAE for speech signals (spectrogram) modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that does not require the use of a parallel dataset of clean and noisy speech samples for training, but only requires clean speech signals. In this paper, we extend those works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm based on the most general form of DVAEs, that we then adapt to three specific DVAE models to illustrate the versatility of the framework. More precisely, we combine DVAE-based speech priors with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. Experimental results show that the proposed approach based on DVAEs outperforms its VAE counterpart and a supervised speech enhancement baseline.

show abstract

Learning and controlling the source-filter representation of speech with a variational autoencoder

Sadok

Leglaive

Girin

et al. 2023

Speech Communication

View full text Add to dashboard Cite

Guided Variational Autoencoder for Speech Enhancement with a Supervised Classifier

Cited by 16 publications

References 19 publications

Disentanglement Learning for Variational Autoencoders Applied to Audio-Visual Speech Enhancement

Disentanglement Learning for Variational Autoencoders Applied to Audio-Visual Speech Enhancement

Unsupervised Speech Enhancement using Dynamical Variational Auto-Encoders

Learning and controlling the source-filter representation of speech with a variational autoencoder

Contact Info

Product

Resources

About