Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

Zhen, Kai; Lee, Mi Suk; Sung, Jongwoo; Beack, Seungkwon; Kim, Minje

doi:10.1109/icassp40776.2020.9054347

Cited by 15 publications

(11 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Each frame contains T = 512 samples with an overlap of 32 samples, where a Hann window is applied to the overlapping region. Note that the choice of frame size is to align the system's hyperparameters to the previous work [21], [26], [31], but it does not necessarily mean that 512 results in an enough frequency resolution for PAM-based lost terms. For training, hyperparameters are found based on validation with another 104 clips: 128 frames for the batch size; α = 300 for the initial softmax scaling factor; 2 × 10 −4 for the initial learning rate of the Adam optimizer [32], and 2 × 10 for the second cascaded modules; 64 and 32 kernels for the quantization for low and high bitrate cases, respectively; 50 and 30 for the number of epochs to train the first and the second modules in CMRL, respectively.…”

Section: Methodsmentioning

confidence: 99%

Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding

Zhen

Lee

Sung

et al. 2020

IEEE Signal Process. Lett.

Self Cite

View full text Add to dashboard Cite

Conventional audio coding technologies commonly leverage human perception of sound, or psychoacoustics, to reduce the bitrate while preserving the perceptual quality of the decoded audio signals. For neural audio codecs, however, the objective nature of the loss function usually leads to suboptimal sound quality as well as high run-time complexity due to the large model size. In this work, we present a psychoacoustic calibration scheme to re-define the loss functions of neural audio coding systems so that it can decode signals more perceptually similar to the reference, yet with a much lower model complexity. The proposed loss function incorporates the global masking threshold, allowing the reconstruction error that corresponds to inaudible artifacts. Experimental results show that the proposed model outperforms the baseline neural codec twice as large and consuming 23.4% more bits per second. With the proposed method, a lightweight neural codec, with only 0.9 million parameters, performs near-transparent audio coding comparable with the commercial MPEG-1 Audio Layer III codec at 112 kbps.

show abstract

Section: Methodsmentioning

confidence: 99%

Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding

Zhen

Lee

Sung

et al. 2020

IEEE Signal Process. Lett.

Self Cite

View full text Add to dashboard Cite

show abstract

“…1) The compact NWC module and its performance: Compared to our previous models in [20] [46][47] that use 0.45 million parameters, the newly proposed NWC in this work only has 0.35 million parameters. It is also a significant reduction from the other compact neural speech codec [31] with 1.6 million parameters.…”

Section: Objective Measurementsmentioning

confidence: 99%

Scalable and Efficient Neural Speech Coding: A Hybrid Design

Zhen

Sung

Lee

et al. 2022

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

This work presents a scalable and efficient neural waveform codec (NWC) for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN) performs encoding and decoding as its feedforward routine. The proposed CNN autoencoder also defines quantization and entropy coding as a trainable module, so the coding artifacts and bitrate control are handled during the optimization process. We achieve efficiency by introducing compact model architectures to our fully convolutional network model, such as gated residual networks and depthwise separable convolution. Furthermore, the proposed models are with a scalable architecture, cross-module residual learning (CMRL), to cover a wide range of bitrates. To this end, we employ the residual coding concept to concatenate multiple NWC autoencoding modules, where an NWC module performs residual coding to restore any reconstruction loss that its preceding modules have created. CMRL can scale down to cover lower bitrates as well, for which it employs linear predictive coding (LPC) module as its first autoencoder. Once again, instead of a mere concatenation of LPC and NWC, we redefine LPC's quantization as a trainable module to enhance the bit allocation tradeoff between LPC and its following NWC modules. Compared to the other autoregressive decoder-based neural speech coders, our decoder has significantly smaller architecture, e.g., with only 0.12 million parameters, more than 100 times smaller than a WaveNet decoder. Compared to the LPCNet-based speech codec, which leverages the speech production model to reduce the network complexity in low bitrates, ours can scale up to higher bitrates to achieve transparent performance. Our lightweight neural speech coding model achieves comparable subjective scores against AMR-WB at the low bitrate range and provides transparent coding quality at 32 kbps.

show abstract

“…< l a t e x i t s h a 1 _ b a s e 6 4 = " s I f e 8 P l M q B H A X x q K o i E G n U M o k m E = " > A A A C B n i c Z Z D L S g M x F I Y z 3 q 2 3 q k s 3 w S K 4 k H Z G F F 2 J o A u X F W w V b J F M 5 o z G 5 j I k Z 4 p l 6 N 6 l W 3 0 I d + L W 1 / A Z f A n T y 8 L L g c B 3 / u Q / n P x x J o X D M P w M J i a n p m d m 5 + Z L C 4 t L y y v l 1 b W m M 7 n l 0 O B G G n s V M w d S a G i g Q A l X m Q W m Y g m X c e d k c H / Z B e u E 0 R f Y y 6 C t 2 K 0 W q e A M v d R s d R O D 7 q Z c C a v h s O h / i M Z Q I e O q 3 5 S / W o n h u Q K N X D L n r q M w w 3 b B L A o u o V 9 q 5 Q 4 y x j v s F q 4 9 a q b A t Y v h t n 2 6 5 Z W E p s b 6 o 5 E O 1 Z + O g i n n e i r e o R 4 U w 7 s d G i t v G 6 D 7 P R r T w 3 Y h d J Y j a D 6 a n O a S o q G D v 9 J E W O A o e x 4 Y t 8 I v R / k d s 4 y j T 6 T U G h q L W s P 5 r q a E v o e O U L V T a 7 L Y P N Q S S K s O s F / y 6 U R / s / g P z d 1 q t F 8 N z / c q x 0 f j n O b I B t k k 2 y Q i B + S Y n J E 6 a R B O 7 s k T e S Y v w W P w G r w F 7 6 O n E 8 H Y s 0 5 + V f D x D c i M m J 0 = < / l a t e x i t > . .…”

Section: Q(•)mentioning

confidence: 99%

“…For example, fully-convolutional autoencoders have been successfully transformed into a codec, whose bottleneck layer is quantized to produce bitstings out of waveforms [5]. These relatively compact waveform codecs start to compete with AMR-WB and Opus after being coupled with linear predictive coding (LPC) [6]. Meanwhile, generative models, such as WaveNet [7], have proven to be effective towards speech coding reducing bitrates down to 2.4kbps, while retaining reasonable speech quality [8,9].…”

Section: Introductionmentioning

confidence: 99%

HARP-Net: Hyper-Autoencoded Reconstruction Propagation for Scalable Neural Audio Coding

Petermann¹,

Beack²,

Kim³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

We propose a novel autoencoder architecture that improves the architectural scalability of general-purpose neural audio coding models. An autoencoder-based codec employs quantization to turn its bottleneck layer activation into bitstrings, a process that hinders information flow between the encoder and decoder parts. To circumvent this issue, we employ additional skip connections between the corresponding pair of encoder-decoder layers. The assumption is that, in a mirrored autoencoder topology, a decoder layer reconstructs the intermediate feature representation of its corresponding encoder layer. Hence, any additional information directly propagated from the corresponding encoder layer helps the reconstruction. We implement this kind of skip connections in the form of additional autoencoders, each of which is a small codec that compresses the massive data transfer between the paired encoderdecoder layers. We empirically verify that the proposed hyperautoencoded architecture improves perceptual audio quality compared to an ordinary autoencoder baseline.

show abstract

Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

Cited by 15 publications

References 23 publications

Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding

Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding

Scalable and Efficient Neural Speech Coding: A Hybrid Design

HARP-Net: Hyper-Autoencoded Reconstruction Propagation for Scalable Neural Audio Coding

Contact Info

Product

Resources

About