Speech Denoising in the Waveform Domain With Self-Attention

Kong, Zhifeng; Wei, Ping; Ambrish, Dantrey,; Catanzaro, Bryan

doi:10.1109/icassp43922.2022.9746169

Cited by 54 publications

(33 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The ablation results on ASR performance also illustrate the efficacy of the TA and FA modules. In Table VIII, we compare the computation required by the models (ResTCN, ResTCN+TFA, MHANet, and MHANet+TFA), in terms of real-time factor (RTF) [61], which is the ratio of the time taken to process a speech utterance to the duration of the utterance. The RTFs are measured on an NVIDIA Tesla V100 GPU, averaged over 10 executions.…”

Section: Experiments On Asr Performancementioning

confidence: 99%

A Time-Frequency Attention Module for Neural Speech Enhancement

Zhang

Qian

Ni³

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Speech enhancement plays an essential role in a wide range of speech processing applications. Recent studies on speech enhancement tend to investigate how to effectively capture the long-term contextual dependencies of speech signals to boost performance. However, these studies generally neglect the time-frequency (T-F) distribution information of speech spectral components, which is equally important for speech enhancement. In this paper, we propose a simple yet very effective network module, which we term the T-F attention (TFA) module, that uses two parallel attention branches, i.e., time-frame attention and frequency-channel attention, to explicitly exploit position information to generate a 2-D attention map to characterise the salient T-F speech distribution. We validate our TFA module as part of two widely used backbone networks (residual temporal convolution network and Transformer) and conduct speech enhancement with four most popular training objectives. Our extensive experiments demonstrate that our proposed TFA module consistently leads to substantial enhancement performance improvements in terms of the five most widely used objective metrics, with negligible parameter overheads. In addition, we further evaluate the efficacy of speech enhancement as a frontend for a downstream speech recognition task. Our evaluation results show that the TFA module significantly improves the robustness of the system to noisy conditions.

show abstract

Section: Experiments On Asr Performancementioning

confidence: 99%

A Time-Frequency Attention Module for Neural Speech Enhancement

Zhang

Qian

Ni³

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…supported, most speech technology tools are developed in Python ML frameworks, in particular PyTorch [9][10][11][12][13]. As Matlab does not support direct import of PyTorch models, this limits the extensibility of the current DIVA model.…”

Section: Plos Onementioning

confidence: 99%

“…These trends have resulted in many sophisticated opensource tools for processing speech and speech audio. Some examples of these tools are pyAu-dioAnalysis [15], PyTorch-Kaldi [9], SpeechBrain [10], ASVtorch [11], WaveNet [16], and Diff-Wave [13]. The current DIVA implementation in Simulink does not integrate directly with these tools and deep learning frameworks.…”

Section: Plos Onementioning

confidence: 99%

“…In this paper, we validate the TorchDIVA implementation module-by-module relative to the original DIVA implementation in Matlab Simulink. In addition, we provide a proof-ofconcept integration between an existing model built in PyTorch (DiffWave [13]) and Torch-DIVA to produce higher quality speech output. Despite the sophistication of the DIVA architecture, it lacks direct integration with human speech experiments.…”

Section: Plos Onementioning

confidence: 99%

“…DiffWave is a diffusion probabilistic model for generative audio synthesis first published by Kong et al [13]. DiffWave is a bidirectional convolutional network capable of rapid and highquality speech synthesis.…”

Section: Extending Torchdiva Using Diffwavementioning

confidence: 99%

See 2 more Smart Citations

TorchDIVA: An extensible computational model of speech production built on an open-source machine learning library

2023

View full text Add to dashboard Cite

The DIVA model is a computational model of speech motor control that combines a simulation of the brain regions responsible for speech production with a model of the human vocal tract. The model is currently implemented in Matlab Simulink; however, this is less than ideal as most of the development in speech technology research is done in Python. This means there is a wealth of machine learning tools which are freely available in the Python ecosystem that cannot be easily integrated with DIVA. We present TorchDIVA, a full rebuild of DIVA in Python using PyTorch tensors. DIVA source code was directly translated from Matlab to Python, and built-in Simulink signal blocks were implemented from scratch. After implementation, the accuracy of each module was evaluated via systematic block-by-block validation. The TorchDIVA model is shown to produce outputs that closely match those of the original DIVA model, with a negligible difference between the two. We additionally present an example of the extensibility of TorchDIVA as a research platform. Speech quality enhancement in TorchDIVA is achieved through an integration with an existing PyTorch generative vocoder called DiffWave. A modified DiffWave mel-spectrum upsampler was trained on human speech waveforms and conditioned on the TorchDIVA speech production. The results indicate improved speech quality metrics in the DiffWave-enhanced output as compared to the baseline. This enhancement would have been difficult or impossible to accomplish in the original Matlab implementation. This proof-of-concept demonstrates the value TorchDIVA can bring to the research community. Researchers can download the new implementation at: https://github.com/skinahan/DIVA_PyTorch.

show abstract

Quantisation Scale-Spaces

Peter

2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Probabilistic diffusion models excel at sampling new images from learned distributions. Originally motivated by drift-diffusion concepts from physics, they apply image perturbations such as noise and blur in a forward process that results in a tractable probability distribution. A corresponding learned reverse process generates images and can be conditioned on side information, which leads to a wide variety of practical applications. Most of the research focus currently lies on practice-oriented extensions. In contrast, the theoretical background remains largely unexplored, in particular the relations to drift-diffusion. In order to shed light on these connections to classical image filtering, we propose a generalised scale-space theory for probabilistic diffusion models. Moreover, we show conceptual and empirical connections to diffusion and osmosis filters.

show abstract

Speech Denoising in the Waveform Domain With Self-Attention

Cited by 54 publications

References 40 publications

A Time-Frequency Attention Module for Neural Speech Enhancement

A Time-Frequency Attention Module for Neural Speech Enhancement

TorchDIVA: An extensible computational model of speech production built on an open-source machine learning library

Quantisation Scale-Spaces

Contact Info

Product

Resources

About