Our system is currently under heavy load due to increased usage. We're actively working on upgrades to improve performance. Thank you for your patience.
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9746169
|View full text |Cite
|
Sign up to set email alerts
|

Speech Denoising in the Waveform Domain With Self-Attention

Abstract: In this work, we present CleanUNet 2, a speech denoising model that combines the advantages of waveform denoiser and spectrogram denoiser and achieves the best of both worlds. CleanUNet 2 uses a two-stage framework inspired by popular speech synthesis methods that consist of a waveform model and a spectrogram model. Specifically, CleanUNet 2 builds upon CleanUNet, the state-of-the-art waveform denoiser, and further boosts its performance by taking predicted spectrograms from a spectrogram denoiser as the input… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
31
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 54 publications
(33 citation statements)
references
References 40 publications
0
31
0
Order By: Relevance
“…The ablation results on ASR performance also illustrate the efficacy of the TA and FA modules. In Table VIII, we compare the computation required by the models (ResTCN, ResTCN+TFA, MHANet, and MHANet+TFA), in terms of real-time factor (RTF) [61], which is the ratio of the time taken to process a speech utterance to the duration of the utterance. The RTFs are measured on an NVIDIA Tesla V100 GPU, averaged over 10 executions.…”
Section: Experiments On Asr Performancementioning
confidence: 99%
“…The ablation results on ASR performance also illustrate the efficacy of the TA and FA modules. In Table VIII, we compare the computation required by the models (ResTCN, ResTCN+TFA, MHANet, and MHANet+TFA), in terms of real-time factor (RTF) [61], which is the ratio of the time taken to process a speech utterance to the duration of the utterance. The RTFs are measured on an NVIDIA Tesla V100 GPU, averaged over 10 executions.…”
Section: Experiments On Asr Performancementioning
confidence: 99%
“…supported, most speech technology tools are developed in Python ML frameworks, in particular PyTorch [9][10][11][12][13]. As Matlab does not support direct import of PyTorch models, this limits the extensibility of the current DIVA model.…”
Section: Plos Onementioning
confidence: 99%
“…These trends have resulted in many sophisticated opensource tools for processing speech and speech audio. Some examples of these tools are pyAu-dioAnalysis [15], PyTorch-Kaldi [9], SpeechBrain [10], ASVtorch [11], WaveNet [16], and Diff-Wave [13]. The current DIVA implementation in Simulink does not integrate directly with these tools and deep learning frameworks.…”
Section: Plos Onementioning
confidence: 99%
See 2 more Smart Citations