ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414363
|View full text |Cite
|
Sign up to set email alerts
|

Guided Variational Autoencoder for Speech Enhancement with a Supervised Classifier

Abstract: Recently, variational autoencoders have been successfully used to learn a probabilistic prior over speech signals, which is then used to perform speech enhancement. However, variational autoencoders are trained on clean speech only, which results in a limited ability of extracting the speech signal from noisy speech compared to supervised approaches. In this paper, we propose to guide the variational autoencoder with a supervised classifier separately trained on noisy speech. The estimated label is a high-leve… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 16 publications
(11 citation statements)
references
References 19 publications
(26 reference statements)
0
11
0
Order By: Relevance
“…The VAE can be conditioned on a label yn ∈ Y describing a speech attribute (e.g. speech activity) that allows for a more explicit control of speech generation [13]. A common approach is to make use of the label yn by directly inputting it in both the encoder E φ,z (|sn| 2 , yn) and the decoder D θ (zn, yn) (see Fig 1b) [11,12,13].…”
Section: Conditional Vaementioning
confidence: 99%
See 1 more Smart Citation
“…The VAE can be conditioned on a label yn ∈ Y describing a speech attribute (e.g. speech activity) that allows for a more explicit control of speech generation [13]. A common approach is to make use of the label yn by directly inputting it in both the encoder E φ,z (|sn| 2 , yn) and the decoder D θ (zn, yn) (see Fig 1b) [11,12,13].…”
Section: Conditional Vaementioning
confidence: 99%
“…tion [10]. For various speech-related tasks, VAEs have been conditioned on a label describing a speech attribute, such as speaker identity [11,12], phoneme [12] or speech activity [13]. Ideally, the label should be independent from the other latent dimensions to obtain an explicit control of speech generation.…”
Section: Introductionmentioning
confidence: 99%
“…This approach is computationally efficient since it does not require sampling or gradient descent at each step of the algorithm. More recently, a guided VAE was proposed in [36], where the VAE-based clean speech signal prior is defined conditionally on the voice activity detection or the ideal binary mask. This guiding information has to be provided by a supervised classifier, separately trained on noisy speech signals.…”
Section: Related Workmentioning
confidence: 99%
“…where Σ θs,t = Σ θs,t (s 1:t−1 , z1:t ) and z1:t is sampled from p θ (z 1:t |x 1:T ). In practice, this posterior distribution is also intractable, we thus propose a variational approximation q φ (z 1:T |x 1:T ) whose parameters φ need to be jointly estimated together with the noisy mixture model parameters ϕ, in order to compute the speech estimate in (36). As detailed in the next section, we propose a VEM algorithm to do that.…”
Section: B Speech Reconstructionmentioning
confidence: 99%
See 1 more Smart Citation