Wavenet Based Low Rate Speech Coding

Kleijn, W. Bastiaan; Lim, Felicia S. C.; Luebs, Alejandro; Skoglund, Jan; Stimberg, Florian; Wang, Quan; Walters, Thomas C.

doi:10.1109/icassp.2018.8462529

Cited by 119 publications

(95 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is known that none of the well-established objective quality tools were designed to evaluate signals synthesized by non-deterministic generative models. In fact, it was shown in [29] that the enhanced quality achieved with a generative decoder was not predicted by the objective tool. We still conducted this evaluation to understand the performance with an objective quality predictor.…”

Section: Objective Evaluationmentioning

confidence: 99%

Audio Codec Enhancement with Generative Adversarial Networks

Biswas¹,

Jia²

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Audio codecs are typically transform-domain based and efficiently code stationary audio signals, but they struggle with speech and signals containing dense transient events such as applause. Specifically, with these two classes of signals as examples, we demonstrate a technique for restoring audio from coding noise based on generative adversarial networks (GAN). A primary advantage of the proposed GAN-based coded audio enhancer is that the method operates end-to-end directly on decoded audio samples, eliminating the need to design any manually-crafted frontend. Furthermore, the enhancement approach described in this paper can improve the sound quality of low-bit rate coded audio without any modifications to the existent standard-compliant encoders. Subjective tests illustrate that the proposed enhancer improves the quality of speech and difficult to code applause excerpts significantly.

show abstract

Section: Objective Evaluationmentioning

confidence: 99%

Audio Codec Enhancement with Generative Adversarial Networks

Biswas¹,

Jia²

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…LPC is useful in modern neural speech codecs, too. While generative autoregressive models, such as WaveNet, have greatly improved the synthesized speech quality [12], it comes at the cost of model complexity during the decoding process [13]. For example, vector quantized variational autoencoders (VQ-VAE) with WaveNet decoder achieves impressive speech quality at a very low bitrate of 1.6 kbps, yet with approximately 20 million trainable parameters [14].…”

Section: Introductionmentioning

confidence: 99%

Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

Zhen

Lee

Sung

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Scalability and efficiency are desired in neural speech codecs, which supports a wide range of bitrates for applications on various devices. We propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges the computational capacity of advanced neural network models and traditional, yet efficient and domain-specific digital signal processing methods in an integrated manner. We demonstrate that CQ achieves much higher quality than its predecessor at 9 kbps with even lower model complexity. We also show that CQ can scale up to 24 kbps where it outperforms AMR-WB and Opus. As a neural waveform codec, CQ models are with less than 1 million parameters, significantly less than many other generative models.

show abstract

“…Many DNN methods [11][12] take inputs in time-frequency (T-F) domain from short time Fourier transform (STFT) or modified discrete cosine transform (MDCT), etc. Recent DNN-based codecs [13][14] [15] [16] model speech signals in time domain directly without T-F transformation. They are referred to as endto-end methods, yielding competitive performance comparing with current speech coding standards, such as AMR-WB [7].…”

Section: Introductionmentioning

confidence: 99%

Cascaded Cross-Module Residual Learning Towards Lightweight End-to-End Speech Coding

Zhen¹,

Sung²,

Lee³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

Speech codecs learn compact representations of speech signals to facilitate data transmission. Many recent deep neural network (DNN) based end-to-end speech codecs achieve low bitrates and high perceptual quality at the cost of model complexity. We propose a cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules. CMRL differs from other DNN-based speech codecs, in that rather than modeling speech compression problem in a single large neural network, it optimizes a series of less-complicated modules in a two-phase training scheme. The proposed method shows better objective performance than AMR-WB and the state-of-the-art DNNbased speech codec with a similar network architecture. As an end-to-end model, it takes raw PCM signals as an input, but is also compatible with linear predictive coding (LPC), showing better subjective quality at high bitrates than AMR-WB and OPUS. The gain is achieved by using only 0.9 million trainable parameters, a significantly less complex architecture than the other DNN-based codecs in the literature. Index Terms: speech coding, deep neural network, entropy coding, residual learning Model descriptionBefore introducing CMRL as a module carrier, we describe the component module to be hosted by CMRL. The component moduleRecently, an end-to-end DNN speech codec (referred to as Kankanahalli-Net) has shown competitive performance comparable to one of the standards (AMR-WB) [14]. We describe our component model derived from Kankanahalli-Net that consists of bottleneck residual learning [24], soft-to-hard quantization [25], and sub-pixel convolutional neural networks for upsampling [26]. Figure 1 depicts the component module.

show abstract

Wavenet Based Low Rate Speech Coding

Cited by 119 publications

References 31 publications

Audio Codec Enhancement with Generative Adversarial Networks

Audio Codec Enhancement with Generative Adversarial Networks

Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

Cascaded Cross-Module Residual Learning Towards Lightweight End-to-End Speech Coding

Contact Info

Product

Resources

About