Perspectives

Virtanen, Tuomas; Gannot, Sharon

doi:10.1002/9781119279860.ch19

Cited by 6 publications

(7 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most of this research concentrates on overcoming the fundamental limitation of NMF, namely the fact that it models spectro-temporal magnitude or power only, and enabling it to account for phase. For an in-depth discussion of this and other perspectives, see [65].…”

Section: Discussionmentioning

confidence: 99%

Single-Channel Audio Source Separation with NMF: Divergences, Constraints and Algorithms

Févotte

Vincent

Ozerov

2018

Audio Source Separation

Self Cite

View full text Add to dashboard Cite

Spectral decomposition by nonnegative matrix factorisation (NMF) has become state-of-the-art practice in many audio signal processing tasks, such as source separation, enhancement or transcription. This chapter reviews the fundamentals of NMF-based audio decomposition, in unsupervised and informed settings. We formulate NMF as an optimisation problem and discuss the choice of the measure of fit. We present the standard majorisation-minimisation strategy to address optimisation for NMF with common β-divergence, a family of measures of fit that takes the quadratic cost, the generalised Kullback-Leibler divergence and the Itakura-Saito divergence as special cases. We discuss the reconstruction of time-domain components from the spectral factorisation and present common variants of NMFbased spectral decomposition: supervised and informed settings, regularised versions, temporal models.

show abstract

Section: Discussionmentioning

confidence: 99%

Single-Channel Audio Source Separation with NMF: Divergences, Constraints and Algorithms

Févotte

Vincent

Ozerov

2018

Audio Source Separation

Self Cite

View full text Add to dashboard Cite

show abstract

“…To that aim, we propose a probabilistic mixture of the audio and visual models. Due to the limited space, the details of the derivations are provided in a supplementary document available online 1 .…”

Section: Inference With An Audio-visual Mixturementioning

confidence: 99%

“…T HE recent impressive performance of deep neural networks (DNNs) in computer vision and machine learning has paved the way to revisit many important signal processing problems. One such problem is that of speech enhancement, i.e., the task of estimating a clean speech from its noisy observation [1], [2]. DNNs have been widely utilized for this task, where a neural network is trained to map a noisy speech spectrogram to its clean version, or to a time frequency (TF) mask [3]- [5].…”

Section: Introductionmentioning

confidence: 99%

Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

Sadeghi¹,

Alameda-Pineda²

2019

Preprint

View full text Add to dashboard Cite

In this paper, we are interested in unsupervised speech enhancement using latent variable generative models. We propose to learn a generative model for clean speech spectrogram based on a variational autoencoder (VAE) where a mixture of audio and visual networks is used to infer the posterior of the latent variables. This is motivated by the fact that visual data, i.e., lips images of the speaker, provide helpful and complementary information about speech. As such, they can help train a richer inference network. Moreover, during speech enhancement, visual data are used to initialize the latent variables, thus providing a more robust initialization than the noisy speech spectrogram. A variational inference approach is derived to train the proposed VAE. Thanks to the novel inference procedure and the robust initialization, the proposed audio-visual mixture VAE exhibits superior performance on speech enhancement than using the standard audio-only counterpart.

show abstract

“…Speech enhancement -or how to estimate clean speech from a noisy signal -has attracted a lot of attention, both for singleand multi-channel audio recordings [1][2][3][4]. Recently, generative models have been utilized for speech enhancement [5][6][7][8][9][10].…”

Section: Introductionmentioning

confidence: 99%

Robust Unsupervised Audio-Visual Speech Enhancement Using a Mixture of Variational Autoencoders

Sadeghi

Alameda-Pineda

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Recently, an audio-visual speech generative model based on variational autoencoder (VAE) has been proposed, which is combined with a nonnegative matrix factorization (NMF) model for noise variance to perform unsupervised speech enhancement. When visual data is clean, speech enhancement with audio-visual VAE shows a better performance than with audio-only VAE, which is trained on audio-only data. However, audio-visual VAE is not robust against noisy visual data, e.g., when for some video frames, speaker face is not frontal or lips region is occluded. In this paper, we propose a robust unsupervised audio-visual speech enhancement method based on a per-frame VAE mixture model. This mixture model consists of a trained audio-only VAE and a trained audio-visual VAE. The motivation is to skip noisy visual frames by switching to the audio-only VAE model. We present a variational expectation-maximization method to estimate the parameters of the model. Experiments show the promising performance of the proposed method.

show abstract

Perspectives

Cited by 6 publications

References 57 publications

Single-Channel Audio Source Separation with NMF: Divergences, Constraints and Algorithms

Single-Channel Audio Source Separation with NMF: Divergences, Constraints and Algorithms

Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

Robust Unsupervised Audio-Visual Speech Enhancement Using a Mixture of Variational Autoencoders

Contact Info

Product

Resources

About