Harp-Net: Hyper-Autoencoded Reconstruction Propagation for Scalable Neural Audio Coding

Petermann, Darius; Beack, Seungkwon; Kim, Minje

doi:10.1109/waspaa52581.2021.9632723

Cited by 14 publications

(3 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other coding architectures use quantized features from different layers of an autoencoder network to code speech at different bitrates [96]. For example, residual networks (ResNet) use "short-cuts" to pass information from one layer directly to a successor layer, as a bypass, while another approach [138] cascades residuals across a series of DNN modules.…”

Section: Residual Network Codingmentioning

confidence: 99%

Review of methods for coding of speech signals

O’Shaughnessy

2023

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Speech is the most common form of human communication, and many conversations use digital communication links. For efficient transmission, acoustic speech waveforms are usually converted to digital form, with reduced bit rates, while maintaining decoded speech quality. This paper reviews the history of speech coding techniques, from early mu-law logarithmic compression to recent neural-network methods. The techniques are examined in terms of output quality, algorithmic complexity, delay, and cost. Focus is on which aspects of speech can be exploited for high-quality transmission. The choices made to code speech are motivated by efficiency, the needs of applications, and access to information in the speech signal that is useful for both intelligibility and naturalness in the reconstructed speech at the decoder.

show abstract

Section: Residual Network Codingmentioning

confidence: 99%

Review of methods for coding of speech signals

O’Shaughnessy

2023

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

show abstract

“…In this work, we combine the merit from both FP-and BP-QAT and propose General Quantizer (GQ) that navigates weights to quantization centroids without introducing augmented regularizers but via feedforward-only operators. Our work is inspired by a continuous relaxation of quantization [25] also used for speech representation learning [26,27,28,29,30,31,32], and µ-Law algorithm for 8-bit pulse-code modulation (PCM) digital telecommunication [33].…”

Section: Related Qat Approachesmentioning

confidence: 99%

Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

Zhang¹,

Radfar²,

Nguyen³

et al. 2022

Preprint

View full text Add to dashboard Cite

For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency. Among existing QAT methods, one major drawback is that the quantization centroids have to be predetermined and fixed. To overcome this limitation, we introduce a regularization-free, "soft-to-hard" compression mechanism with self-adjustable centroids in a µ-Law constrained space, resulting in a simpler yet more versatile quantization scheme, called General Quantizer (GQ). We apply GQ to ASR tasks using Recurrent Neural Network Transducer (RNN-T) and Conformer architectures on both LibriSpeech and de-identified far-field datasets. Without accuracy degradation, GQ can compress both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit for fast and accurate inference. We observe a 30.73% memory footprint saving and 31.75% user-perceived latency reduction compared to 8-bit QAT via physical device benchmarking.

show abstract

“…This has motivated data-driven approaches to train neural networks to perform speech coding. These networks leverage large amounts of training data while relaxing the assumptions made on the type of transformations applied by the system [3][4][5][6][7][8][9][10]. In particular, the SoundStream neural codec combines a causal convolutional architecture with a residual vector quantizer.…”

Section: Introductionmentioning

confidence: 99%

LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models

Jenrungrot¹,

Chinen²,

Kleijn³

et al. 2023

Preprint

View full text Add to dashboard Cite

We introduce LMCodec, a causal neural speech codec that provides high quality audio at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at https://mjenru ngrot.github.io/chrome-media-audio-papers/publi cations/lmcodec.

show abstract

Harp-Net: Hyper-Autoencoded Reconstruction Propagation for Scalable Neural Audio Coding

Cited by 14 publications

References 9 publications

Review of methods for coding of speech signals

Review of methods for coding of speech signals

Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models

Contact Info

Product

Resources

About