ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054347
|View full text |Cite
|
Sign up to set email alerts
|

Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

Abstract: Scalability and efficiency are desired in neural speech codecs, which supports a wide range of bitrates for applications on various devices. We propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges the computational capacity of advanced neural network models and traditional, yet efficient and domain-specific digital signal processing methods in an integrated manner. We … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3

Relationship

4
3

Authors

Journals

citations
Cited by 15 publications
(11 citation statements)
references
References 23 publications
0
10
0
Order By: Relevance
“…Each frame contains T = 512 samples with an overlap of 32 samples, where a Hann window is applied to the overlapping region. Note that the choice of frame size is to align the system's hyperparameters to the previous work [21], [26], [31], but it does not necessarily mean that 512 results in an enough frequency resolution for PAM-based lost terms. For training, hyperparameters are found based on validation with another 104 clips: 128 frames for the batch size; α = 300 for the initial softmax scaling factor; 2 × 10 −4 for the initial learning rate of the Adam optimizer [32], and 2 × 10 for the second cascaded modules; 64 and 32 kernels for the quantization for low and high bitrate cases, respectively; 50 and 30 for the number of epochs to train the first and the second modules in CMRL, respectively.…”
Section: Methodsmentioning
confidence: 99%
“…Each frame contains T = 512 samples with an overlap of 32 samples, where a Hann window is applied to the overlapping region. Note that the choice of frame size is to align the system's hyperparameters to the previous work [21], [26], [31], but it does not necessarily mean that 512 results in an enough frequency resolution for PAM-based lost terms. For training, hyperparameters are found based on validation with another 104 clips: 128 frames for the batch size; α = 300 for the initial softmax scaling factor; 2 × 10 −4 for the initial learning rate of the Adam optimizer [32], and 2 × 10 for the second cascaded modules; 64 and 32 kernels for the quantization for low and high bitrate cases, respectively; 50 and 30 for the number of epochs to train the first and the second modules in CMRL, respectively.…”
Section: Methodsmentioning
confidence: 99%
“…1) The compact NWC module and its performance: Compared to our previous models in [20] [46][47] that use 0.45 million parameters, the newly proposed NWC in this work only has 0.35 million parameters. It is also a significant reduction from the other compact neural speech codec [31] with 1.6 million parameters.…”
Section: Objective Measurementsmentioning
confidence: 99%
“…< l a t e x i t s h a 1 _ b a s e 6 4 = " s I f e 8 P l M q B H A X x q K o i E G n U M o k m E = " > A A A C B n i c Z Z D L S g M x F I Y z 3 q 2 3 q k s 3 w S K 4 k H Z G F F 2 J o A u X F W w V b J F M 5 o z G 5 j I k Z 4 p l 6 N 6 l W 3 0 I d + L W 1 / A Z f A n T y 8 L L g c B 3 / u Q / n P x x J o X D M P w M J i a n p m d m 5 + Z L C 4 t L y y v l 1 b W m M 7 n l 0 O B G G n s V M w d S a G i g Q A l X m Q W m Y g m X c e d k c H / Z B e u E 0 R f Y y 6 C t 2 K 0 W q e A M v d R s d R O D 7 q Z c C a v h s O h / i M Z Q I e O q 3 5 S / W o n h u Q K N X D L n r q M w w 3 b B L A o u o V 9 q 5 Q 4 y x j v s F q 4 9 a q b A t Y v h t n 2 6 5 Z W E p s b 6 o 5 E O 1 Z + O g i n n e i r e o R 4 U w 7 s d G i t v G 6 D 7 P R r T w 3 Y h d J Y j a D 6 a n O a S o q G D v 9 J E W O A o e x 4 Y t 8 I v R / k d s 4 y j T 6 T U G h q L W s P 5 r q a E v o e O U L V T a 7 L Y P N Q S S K s O s F / y 6 U R / s / g P z d 1 q t F 8 N z / c q x 0 f j n O b I B t k k 2 y Q i B + S Y n J E 6 a R B O 7 s k T e S Y v w W P w G r w F 7 6 O n E 8 H Y s 0 5 + V f D x D c i M m J 0 = < / l a t e x i t > . .…”
Section: Q(•)mentioning
confidence: 99%
“…For example, fully-convolutional autoencoders have been successfully transformed into a codec, whose bottleneck layer is quantized to produce bitstings out of waveforms [5]. These relatively compact waveform codecs start to compete with AMR-WB and Opus after being coupled with linear predictive coding (LPC) [6]. Meanwhile, generative models, such as WaveNet [7], have proven to be effective towards speech coding reducing bitrates down to 2.4kbps, while retaining reasonable speech quality [8,9].…”
Section: Introductionmentioning
confidence: 99%