ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8683277
|View full text |Cite
|
Sign up to set email alerts
|

Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder

Abstract: In order to efficiently transmit and store speech signals, speech codecs create a minimally redundant representation of the input signal which is then decoded at the receiver with the best possible perceptual quality. In this work we demonstrate that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality. A prosody-transparent and speaker-independent model trained on the LibriSpeech corpus coding audio at 1.6 … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

2
63
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 79 publications
(65 citation statements)
references
References 12 publications
2
63
0
Order By: Relevance
“…Hence, to capture any temporal correlation, we require timedomain coupling in the autoencoding process, which can be achieved by either (1) applying a feed-forward network that can access data from multiple time-steps, e.g. using temporal convolution [8] or selfattention [12], or (2) using recurrent network architectures. In this work, we focus on the latter approach.…”
Section: Recurrent Autoencodermentioning
confidence: 99%
See 1 more Smart Citation
“…Hence, to capture any temporal correlation, we require timedomain coupling in the autoencoding process, which can be achieved by either (1) applying a feed-forward network that can access data from multiple time-steps, e.g. using temporal convolution [8] or selfattention [12], or (2) using recurrent network architectures. In this work, we focus on the latter approach.…”
Section: Recurrent Autoencodermentioning
confidence: 99%
“…It is worth noting that the video compression framework of DVC [13] can be viewed as an instantiation of Fig. 1(f) where decoded data in the previous time steps are fed back to the encoder for explicit motion and residual information compression, and the one proposed in VQ-VAE [7,8] can be viewed as a convolutional variant of Fig. 1(d) where both encoder and decoder use convolution to cover a large temporal receptive field without any decoder-to-encoder feedback.…”
Section: Recurrent Autoencodermentioning
confidence: 99%
“…While generative autoregressive models, such as WaveNet, have greatly improved the synthesized speech quality [12], it comes at the cost of model complexity during the decoding process [13]. For example, vector quantized variational autoencoders (VQ-VAE) with WaveNet decoder achieves impressive speech quality at a very low bitrate of 1.6 kbps, yet with approximately 20 million trainable parameters [14]. To make such a system more efficient, LPC can still unload computational overheads from neural networks.…”
Section: Introductionmentioning
confidence: 99%
“…Many DNN methods [11][12] take inputs in time-frequency (T-F) domain from short time Fourier transform (STFT) or modified discrete cosine transform (MDCT), etc. Recent DNN-based codecs [13][14] [15] [16] model speech signals in time domain directly without T-F transformation. They are referred to as endto-end methods, yielding competitive performance comparing with current speech coding standards, such as AMR-WB [7].…”
Section: Introductionmentioning
confidence: 99%
“…Many DNN-based codecs achieve both low bitrates and high perceptual quality, two main targets for speech codecs [17][18] [19], but with a high model complexity. A WaveNet based variational autoencoder (VAE) [16] outperforms other low bitrate codecs in the listening test, however, with 20 millions parameters, a too big model for real-time processing in a resource-constrained device. Similarly, codecs built on Sam-pleRNN [20] [21] can also be energy-intensive.…”
Section: Introductionmentioning
confidence: 99%