Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

Sadeghi, Mostafa; Alameda-Pineda, Xavier

doi:10.48550/arxiv.1912.10647

Cited by 4 publications

(12 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Listening tests for speech DRT [255] 1983 Audio-only listening test using intelligibility assessment rhyming words HINT [191] 1994 Audio-only listening test using everyday sentences Matrix-like audio-visual 2019 Matrix test using audio-visual [178] test [178] stimuli [13] Estimators of speech quality PESQ [117], [119], [120], [214] 2001 Designed to assess quality across a [3], [5]- [7], [12], [17], [37], [55], [65] based on perceptual models wide range of codecs and network [66], [76], [77], [85], [99], [107], [108] conditions mostly for telephony [109], [122], [128], [136], [153], [154] [176], [178], [179], [183], [220]- [222] [239], [244], [263], [274], [279] CSIG / CBAK / COVRL [104] 2007 Composite measures which combine [108] basic objective measures HASQI [131], [133] 2010 Specifically designed for hearing- [99], [100] impaired listeners POLQA [1...…”

Section: Typementioning

confidence: 99%

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Michelsanti

Sigurðsson

Jensen

2019

Speech Communication

View full text Add to dashboard Cite

Keywords:Lombard effect audio-visual speech enhancement deep learning speech quality speech intelligibility A B S T R A C T When speaking in presence of background noise, humans reflexively change their way of speaking in order to improve the intelligibility of their speech. This reflex is known as Lombard effect. Collecting speech in Lombard conditions is usually hard and costly. For this reason, speech enhancement systems are generally trained and evaluated on speech recorded in quiet to which noise is artificially added. Since these systems are often used in situations where Lombard speech occurs, in this work we perform an analysis of the impact that Lombard effect has on audio, visual and audio-visual speech enhancement, focusing on deep-learning-based systems, since they represent the current state of the art in the field.We conduct several experiments using an audio-visual Lombard speech corpus consisting of utterances spoken by 54 different talkers. The results show that training deep-learning-based models with Lombard speech is beneficial in terms of both estimated speech quality and estimated speech intelligibility at low signal to noise ratios, where the visual modality can play an important role in acoustically challenging situations. We also find that a performance difference between genders exists due to the distinct Lombard speech exhibited by males and females, and we analyse it in relation with acoustic and visual features. Furthermore, listening tests conducted with audio-visual stimuli show that the speech quality of the signals processed with systems trained using Lombard speech is statistically significantly better than the one obtained using systems trained with non-Lombard speech at a signal to noise ratio of −5 dB. Regarding speech intelligibility, we find a general tendency of the benefit in training the systems with Lombard speech.

show abstract

Section: Typementioning

confidence: 99%

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Michelsanti

Sigurðsson

Jensen

2019

Speech Communication

View full text Add to dashboard Cite

show abstract

“…Both of the algorithms were run for 200 iterations, on the same test set. For optimizing (8), the Adam optimizer [18] was used with a learning rate of 0.05 for 10 iterations. Moreover, we used D = 20 samples to compute (6) and (10).…”

Section: Methodsmentioning

confidence: 99%

“…where, KL denotes the Kullback-Leibler divergence. In (8), the expectation over r m and r s can be evaluated in closedform. This is also the case for the KL term as both the distributions are Gaussian.…”

Section: E-z Stepmentioning

confidence: 99%

“…More precisely, in the proposed framework, the parameters of r s (s|m) are initialized using its respective set of latent codes z, which themselves are initialized by the corresponding encoders (see Section 3), as opposed to [7] where a weighted combination of the latent codes (coming from different models) is used for initializing the parameters of r s (s). This might not be effective given that latent initialization is important in VAE-based AVSE [8]. Finally, the proposed posterior approximation r z (z t |m t ) = N (c tm , Ω tm ) makes sampling, needed by (6), more efficient than the method of [7] which relies on the computationally demanding Metropolis-Hastings algorithm [15].…”

Section: Novelty Of Swvae Wrt [7]mentioning

confidence: 99%

“…Recently, some unsupervised AVSE methods have been proposed that do not need noise signals for training [6][7][8], meaning that their training is agnostic to the noise type. This approach builds upon the audio-only speech enhancement counterpart [9,10] consisting of two main steps.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual Speech Enhancement

Sadeghi¹,

Alameda-Pineda²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Recently, audio-visual speech enhancement has been tackled in the unsupervised settings based on variational autoencoders (VAEs), where during training only clean data is used to train a generative model for speech, which at test time is combined with a noise model, e.g. nonnegative matrix factorization (NMF), whose parameters are learned without supervision. Consequently, the proposed model is agnostic to the noise type. When visual data are clean, audio-visual VAE-based architectures usually outperform the audio-only counterpart. The opposite happens when the visual data are corrupted by clutter, e.g. the speaker not facing the camera. In this paper, we propose to find the optimal combination of these two architectures through time. More precisely, we introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time in an unsupervised manner: leading to switching variational auto-encoder (SwVAE). We propose a variational factorization to approximate the computationally intractable posterior distribution. We also derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal. Our experiments demonstrate the promising performance of SwVAE.

show abstract

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Michelsanti

Tan

Zhang

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

174

View full text Add to dashboard Cite

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. More recently, visual information from the target speakers, such as lip movements and facial expressions, has been introduced to speech enhancement and speech separation systems, because the visual aspect of speech is essentially unaffected by the acoustic environment. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving state-of-the-art performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: visual features; acoustic features; deep learning methods; fusion techniques; training targets and objective functions. We also survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance. In addition, we review deeplearning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation.

show abstract

Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

Cited by 4 publications

References 13 publications

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual Speech Enhancement

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Contact Info

Product

Resources

About