2023
DOI: 10.1109/taslp.2023.3285241
|View full text |Cite
|
Sign up to set email alerts
|

Speech Enhancement and Dereverberation With Diffusion-Based Generative Models

Abstract: This paper introduces an audio-visual speech enhancement system that leverages score-based generative models, also known as diffusion models, conditioned on visual information. In particular, we exploit audio-visual embeddings obtained from a self-supervised learning model that has been fine-tuned on lipreading. The layer-wise features of its transformer-based encoder are aggregated, time-aligned, and incorporated into the noise conditional score network. Experimental evaluations show that the proposed audiovi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
3
0

Year Published

2024
2024
2025
2025

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 53 publications
(9 citation statements)
references
References 70 publications
0
3
0
Order By: Relevance
“…Diffusion models are a class of generative models that have gained interest during recent years for a wide range of modalities, such as images [7,43,44], audio [45,46,16], video [47], and symbolic music [37], among others. These models generate new data instances by reversing the diffusion process, by which data x 0 ∼ p data is progressively diffused into Gaussian noise x T ∼ N (0, σ 2 max I) over time τ [7].…”
Section: A Diffusion Model For Audio Inpaintingmentioning
confidence: 99%
“…Diffusion models are a class of generative models that have gained interest during recent years for a wide range of modalities, such as images [7,43,44], audio [45,46,16], video [47], and symbolic music [37], among others. These models generate new data instances by reversing the diffusion process, by which data x 0 ∼ p data is progressively diffused into Gaussian noise x T ∼ N (0, σ 2 max I) over time τ [7].…”
Section: A Diffusion Model For Audio Inpaintingmentioning
confidence: 99%
“…These networks are termed as deep generative networks, such as variational autoencoder combined with non-negative matrix factorization method, generative adversarial networks, PixelCNN architecture, diffusion based generative networks for enhancement of speech. 5,[15][16][17] Discriminative networks that are convolutional in nature are proposed for speech enhancement. They map the noisy speech to a clean speech target.…”
Section: Related Workmentioning
confidence: 99%
“…In this process, unpleasant signal distortions are introduced in the enhanced speech signal. 5 Hence, a discriminative autoencoder based approach is designed for the denoising of speech signals. The first designed approach, Discriminative Denoising Autoencoder(DDAE), is the combination of a discriminator with an autoencoder and the second approach, discriminative UNET(DUNET), combines a discriminator with UNET based architecture.…”
Section: Introductionmentioning
confidence: 99%
“…In the past, objective measures that are intrusive in nature, i.e., those that require a reference (e.g., [1,2]), have proven especially useful in practice. In recent years, however, audio algorithms based on generative approaches have been gaining popularity (e.g., [3][4][5]). These algorithms aim to predict a plausible audio instance, rather than a specific reference.…”
Section: Introductionmentioning
confidence: 99%
“…Current speech algorithm research is moving towards not previously achievable benefit heights, for example, through generative reconstruction of missing speech content (e.g., [3][4][5]). We may soon see single-channel speech enhancement algorithms that markedly improve intelligibility in challenging real-world environments.…”
Section: Introductionmentioning
confidence: 99%