2019
DOI: 10.48550/arxiv.1912.10647
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

Abstract: In this paper, we are interested in unsupervised speech enhancement using latent variable generative models. We propose to learn a generative model for clean speech spectrogram based on a variational autoencoder (VAE) where a mixture of audio and visual networks is used to infer the posterior of the latent variables. This is motivated by the fact that visual data, i.e., lips images of the speaker, provide helpful and complementary information about speech. As such, they can help train a richer inference networ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0
1

Year Published

2019
2019
2021
2021

Publication Types

Select...
2
2

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(12 citation statements)
references
References 13 publications
0
9
0
1
Order By: Relevance
“…Listening tests for speech DRT [255] 1983 Audio-only listening test using intelligibility assessment rhyming words HINT [191] 1994 Audio-only listening test using everyday sentences Matrix-like audio-visual 2019 Matrix test using audio-visual [178] test [178] stimuli [13] Estimators of speech quality PESQ [117], [119], [120], [214] 2001 Designed to assess quality across a [3], [5]- [7], [12], [17], [37], [55], [65] based on perceptual models wide range of codecs and network [66], [76], [77], [85], [99], [107], [108] conditions mostly for telephony [109], [122], [128], [136], [153], [154] [176], [178], [179], [183], [220]- [222] [239], [244], [263], [274], [279] CSIG / CBAK / COVRL [104] 2007 Composite measures which combine [108] basic objective measures HASQI [131], [133] 2010 Specifically designed for hearing- [99], [100] impaired listeners POLQA [1...…”
Section: Typementioning
confidence: 99%
“…Listening tests for speech DRT [255] 1983 Audio-only listening test using intelligibility assessment rhyming words HINT [191] 1994 Audio-only listening test using everyday sentences Matrix-like audio-visual 2019 Matrix test using audio-visual [178] test [178] stimuli [13] Estimators of speech quality PESQ [117], [119], [120], [214] 2001 Designed to assess quality across a [3], [5]- [7], [12], [17], [37], [55], [65] based on perceptual models wide range of codecs and network [66], [76], [77], [85], [99], [107], [108] conditions mostly for telephony [109], [122], [128], [136], [153], [154] [176], [178], [179], [183], [220]- [222] [239], [244], [263], [274], [279] CSIG / CBAK / COVRL [104] 2007 Composite measures which combine [108] basic objective measures HASQI [131], [133] 2010 Specifically designed for hearing- [99], [100] impaired listeners POLQA [1...…”
Section: Typementioning
confidence: 99%
“…Both of the algorithms were run for 200 iterations, on the same test set. For optimizing (8), the Adam optimizer [18] was used with a learning rate of 0.05 for 10 iterations. Moreover, we used D = 20 samples to compute (6) and (10).…”
Section: Methodsmentioning
confidence: 99%
“…where, KL denotes the Kullback-Leibler divergence. In (8), the expectation over r m and r s can be evaluated in closedform. This is also the case for the KL term as both the distributions are Gaussian.…”
Section: E-z Stepmentioning
confidence: 99%
See 2 more Smart Citations