Speech Enhancement Using Bayesian Wavenet

Qian, Kaizhi; Zhang, Yang; Chang, Shiyu; Yang, Xuesong; Florêncio, Dinei; Hasegawa‐Johnson, Mark

doi:10.21437/interspeech.2017-1672

Cited by 84 publications

(73 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the conclusion of our experiments other more advanced neural-network based denoising techniques have been proposed, such as generative adversarial networks [51], Wavenet-style based systems [52], [53] and convolutional neural networks [54], [55]. The latter showing improvements upon RNN based methods [55] in terms of PESQ and STOI scores.…”

Section: Our Work In Contextmentioning

confidence: 88%

Speech Enhancement of Noisy and Reverberant Speech for Text-to-Speech

Valentini-Botinhao

Yamagishi

2018

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Abstract-Text-to-speech voices created from noisy and reverberant recordings are of lower quality. A simple way to improve this is to increase the quality of the recordings prior to textto-speech training with speech enhancement methods such as noise suppression and dereverberation. In this paper we opted for this approach and to perform the enhancement we used a recurrent neural network. The network is trained with parallel data of clean and lower quality recordings of speech. The lower quality data was artificially created by adding recordings of environmental noise to studio quality recordings of speech and by convolving room impulse responses with these clean recordings. We trained separate networks with noise only, reverberation only and both reverberation and additive noise data. The quality of voices trained with lower quality data that has been enhanced using these networks was significantly higher in all cases. For the noise only case, the enhanced synthetic voice ranked as high as the voice trained with clean data. For the most realistic and challenging scenario, when both noise and reverberation were present, the improvements were more modest, but still significant.

show abstract

Section: Our Work In Contextmentioning

confidence: 88%

Speech Enhancement of Noisy and Reverberant Speech for Text-to-Speech

Valentini-Botinhao

Yamagishi

2018

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…Wavenet [1] is an autoregressive convolutional neural network that produces raw audio waveforms by directly modeling the underlying probability distribution of audio samples. This has led to state-of-the-art performance in text-to-speech synthesis [2], [7], [17], [18], speech recognition [19], and other audio generation settings [1], [3], [4]. The Wavenet architecture aims to model the conditional probability among subsequent audio samples.…”

Section: B Wavenet and Autoregressive Cnnsmentioning

confidence: 99%

“…Autoregressive convolutional models achieve state-of-theart results in audio [1]- [4] and language domains [5], [6] with respect to both estimating the data distribution and generating high-quality samples. Wavenet [1] is an example of autoregressive convolutional network, used for modelling audio for applications such as text-to-speech (TTS) synthesis and music generation.…”

Section: Introductionmentioning

confidence: 99%

FastWave: Accelerating Autoregressive Convolutional Neural Networks on FPGA

Hussain

Javaheripi

Neekhara

et al. 2019

2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

View full text Add to dashboard Cite

Autoregressive convolutional neural networks (CNNs) have been widely exploited for sequence generation tasks such as audio synthesis, language modeling and neural machine translation. WaveNet is a deep autoregressive CNN composed of several stacked layers of dilated convolution that is used for sequence generation. While WaveNet produces state-of-the art audio generation results, the naive inference implementation is quite slow; it takes a few minutes to generate just one second of audio on a high-end GPU. In this work, we develop the first accelerator platform FastWave for autoregressive convolutional neural networks, and address the associated design challenges. We design the Fast-Wavenet inference model in Vivado HLS and perform a wide range of optimizations including fixed-point implementation, array partitioning and pipelining. Our model uses a fully parameterized parallel architecture for fast matrix-vector multiplication that enables per-layer customized latency fine-tuning for further throughput improvement. Our experiments comparatively assess the tradeoff between throughput and resource utilization for various optimizations. Our best WaveNet design on the Xilinx XCVU13P FPGA that uses only on-chip memory, achieves 66× faster generation speed compared to CPU implementation and 11× faster generation speed than GPU implementation.

show abstract

“…That is, these methods exploit the sophistication of the generative network model to find a better approximation of the clean signal waveform. For example, [26] approximates the clean signal waveform using a Bayesian formalism that incorporates the structure of WaveNet. [11] uses a WaveNet structure to create a deterministic mapping from the noisy waveform to the clean waveform approximation.…”

Section: Generative Enhancementmentioning

confidence: 99%

Generative Speech Enhancement Based on Cloned Networks

Chinen

Kleijn

Lim

et al. 2019

2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

View full text Add to dashboard Cite

We propose to implement speech enhancement by the regeneration of clean speech from a 'salient' representation extracted from the noisy signal. The network that extracts salient features is trained using a set of weight-sharing clones of the extractor network. The clones receive mel-frequency spectra of different noisy versions of the same speech signal as input. By encouraging the outputs of the clones to be similar for these different input signals, we train a feature extractor network that is robust to noise. At inference, the salient features form the input to a WaveNet network that generates a natural and clean speech signal with the same attributes as the ground-truth clean signal. As the signal becomes noisier, our system produces natural sounding errors that stay on the speech manifold, in place of traditional artifacts found in other systems. Our experiments confirm that our generative enhancement system provides state-of-the-art enhancement performance within the generative class of enhancers according to a MUSHRA-like test. The clones based system matches or outperforms the other systems at each input signal-to-noise (SNR) range with statistical significance.

show abstract

Speech Enhancement Using Bayesian Wavenet

Cited by 84 publications

References 18 publications

Speech Enhancement of Noisy and Reverberant Speech for Text-to-Speech

Speech Enhancement of Noisy and Reverberant Speech for Text-to-Speech

FastWave: Accelerating Autoregressive Convolutional Neural Networks on FPGA

Generative Speech Enhancement Based on Cloned Networks

Contact Info

Product

Resources

About