Speech Bandwidth Extension with Wavenet

Gupta, Archit; Shillingford, Brendan; Walters, Thomas C.

doi:10.1109/waspaa.2019.8937169

Cited by 24 publications

(20 citation statements)

References 13 publications

(13 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Kuleshov et al [22] used a convolutional encoder-decoder network inspired by image super resolution. WaveNet [23] and its variants for BWE [24,25] use dilated convolutions to enable large receptive field while preserving the original resolution. Feng et al [6] used FFTNet [26] which resembles the classical FFT process.…”

Section: Related Workmentioning

confidence: 99%

Bandwidth Extension is All You Need

Wang

Finkelstein

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speech generation and enhancement have seen recent breakthroughs in quality thanks to deep learning. These methods typically operate at a limited sampling rate of 16-22kHz due to computational complexity and available datasets. This limitation imposes a gap between the output of such methods and that of high-fidelity (≥44kHz) real-world audio applications. This paper proposes a new bandwidth extension (BWE) method that expands 8-16kHz speech signals to 48kHz. The method is based on a feed-forward WaveNet architecture trained with a GAN-based deep feature loss. A mean-opinionscore (MOS) experiment shows significant improvement in quality over state-of-the-art BWE methods. An AB test reveals that our 16to-48kHz BWE is able to achieve fidelity that is typically indistinguishable from real high-fidelity recordings. We use our method to enhance the output of recent speech generation and denoising methods, and experiments demonstrate significant improvement in sound quality over these baselines. We propose this as a general approach to narrow the gap between generated speech and recorded speech, without the need to adapt such methods to higher sampling rates.

show abstract

Section: Related Workmentioning

confidence: 99%

Bandwidth Extension is All You Need

Wang

Finkelstein

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Learning such a long feature sequence poses a challenge to conventional sequential modeling networks, including recurrent neural network and 1-D convolutional networks. Dilated convolutional layers were proposed to alleviate such problem [13,14]. Despite the increased receptive field over the input signals, they have not captured the utterance-level temporal information.…”

Section: Rdpn Core Modulementioning

confidence: 99%

“…Prior studies [5][6][7][8][9][10] are focused on estimating high-frequency magnitude and phase spectra in frequency domain. To overcome the inherent difficulty of phase estimation, time-domain frameworks [11][12][13][14][15] are proposed, that offer competitive voice quality.…”

Section: Introductionmentioning

confidence: 99%

Speaker and Phoneme-Aware Speech Bandwidth Extension with Residual Dual-Path Network

Hou¹,

Xu²,

Pham³

et al. 2020

Interspeech 2020

View full text Add to dashboard Cite

Speech bandwidth extension aims to generate a wideband signal from a narrowband (low-band) input by predicting the missing high-frequency components. It is believed that the general knowledge about the speaker and phonetic content strengthens the prediction. In this paper, we propose to augment the low-band acoustic features with i-vector and phonetic posteriorgram (PPG), which represent speaker and phonetic content of the speech, respectively. We also propose a residual dual-path network (RDPN) as the core module to process the augmented features, which fully utilizes the utterance-level temporal continuity information and avoids gradient vanishing. Experiments show that the proposed method achieves 20.2% and 7.0% relative improvements over the best baseline in terms of log-spectral distortion (LSD) and signal-to-noise ratio (SNR), respectively. Furthermore, our method is 16 times more compact than the best baseline in terms of the number of parameters.

show abstract

“…Early methods were based on the source-filter model of speech production and exploit DNNs to estimate the upper frequency envelope [3]. Inspired by the early success in image superresolution [4], end-to-end audio-based solutions were proposed, based on wave-to-wave UNet [5], WaveNet [6,7], hybrid time/frequency-domain models [8]. All these methods are trained by minimizing a reconstruction loss, typically the The design of SEANet is similar to [12], but we adopt the losses proposed in [13], in which the reconstruction loss is computed in the feature space of the discriminator, at different scales and at different layers.…”

Section: Introductionmentioning

confidence: 99%

Real-Time Speech Frequency Bandwidth Extension

Liu

Tagliasacchi

Rybakov

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper we propose a lightweight model for frequency bandwidth extension of speech signals, increasing the sampling frequency from 8kHz to 16kHz while restoring the high frequency content to a level almost indistinguishable from the 16kHz ground truth. The model architecture is based on SEANet (Sound EnhAncement Network), a wave-to-wave fully convolutional model, which uses a combination of feature losses and adversarial losses to reconstruct an enhanced version of the input speech. In addition, we propose a variant of SEANet that can be deployed on-device in streaming mode, achieving an architectural latency of 16ms. When profiled on a single core of a mobile CPU, processing one 16ms frame takes only 1.5ms. The low latency makes it viable for bi-directional voice communication systems.

show abstract

Speech Bandwidth Extension with Wavenet

Cited by 24 publications

References 13 publications

Bandwidth Extension is All You Need

Bandwidth Extension is All You Need

Speaker and Phoneme-Aware Speech Bandwidth Extension with Residual Dual-Path Network

Real-Time Speech Frequency Bandwidth Extension

Contact Info

Product

Resources

About