Cascaded Cross-Module Residual Learning Towards Lightweight End-to-End Speech Coding

Zhen, Kai; Sung, Jongwoo; Lee, Mi Suk; Beack, Seungkwon; Kim, Minje

doi:10.21437/interspeech.2019-1816

Cited by 31 publications

(23 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A 1D-CNN architecture on the time-domain samples serves the desired lightweight autoencoder (AE) for end-to-end speech coding, where the model complexity is a major concern [19,18]. As shown in Table 1, the encoder part consists of four bottleneck ResNet stages [20], a downsampling convolutional layer to halve the feature map size in the middle, and then a channel compression layer to create a real-valued code vector of 256 dimensions.…”

Section: End-to-end Speech Coding Autoencodersmentioning

confidence: 99%

“…To compress speech signals, a core component of this AE is the trainable quantizer which learns a discrete representation of the code layer in the AE. Out of the recent neural network-compatible quantization schemes, such as VQ-VAE [21] and soft-to-hard quantization [22], we focus on soft-to-hard quantization, namely softmax quantization as in the other end-to-end speech coding AEs [19,18]. Given an input frame x ∈ R S of S samples, the output from the encoder is h = FEnc(x), each is a 16-bit floating-point value.…”

Section: Soft-to-hard (Softmax) Quantizationmentioning

confidence: 99%

“…The overlap region is windowed by Hann function (Fig.3(c)). For 24 kbps, we cascade the LPC module and two AEs as in [18], but we use only one AE for the LPC residual coding for other three bitrates. For 16 and 20 kbps cases, the code layer is downsampled with a convolutional layer of stride 2; for the 9 kbps case, we use two downsampling layers of stride 2.…”

Section: Experimental Settingsmentioning

confidence: 99%

“…Nevertheless, LPCNet, as a vocoder, provides a decent performance at 1.6 kbps, but does not scale up to transparent quality. In terms of the neural waveform coder, CMRL [18] uses LPC as a pre-processor and a variation of [19] to model the LPC residual to match the state-ofthe-art speech quality with only 0.9 million parameters. However, both LPCNet and CMRL take LPC another blackbox shoehorned into advanced neural networks.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

Zhen

Lee

Sung

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Scalability and efficiency are desired in neural speech codecs, which supports a wide range of bitrates for applications on various devices. We propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges the computational capacity of advanced neural network models and traditional, yet efficient and domain-specific digital signal processing methods in an integrated manner. We demonstrate that CQ achieves much higher quality than its predecessor at 9 kbps with even lower model complexity. We also show that CQ can scale up to 24 kbps where it outperforms AMR-WB and Opus. As a neural waveform codec, CQ models are with less than 1 million parameters, significantly less than many other generative models.

show abstract

Section: End-to-end Speech Coding Autoencodersmentioning

confidence: 99%

Section: Soft-to-hard (Softmax) Quantizationmentioning

confidence: 99%

Section: Experimental Settingsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

Zhen

Lee

Sung

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Kankanahalli proposes a model that consists of fully convolutional layers to integrate dimension reduction, quantization, and entropy control tasks [6]. Cross-module residual learning (CMRL) inherits the convolutional pipeline and proposes a cascading structure, where multiple autoencoders are concatenated to work on the residual signal produced by the preceding ones [7]. In [8], CMRL is coupled with a trainable linear predictive coding (LPC) module as a preprocessor.…”

Section: Introductionmentioning

confidence: 99%

Source-Aware Neural Speech Coding for Noisy Speech Compression

Yang

Zhen

Beack

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

This paper introduces a novel neural network-based speech coding system that can process noisy speech effectively. The proposed source-aware neural audio coding (SANAC) system harmonizes a deep autoencoder-based source separation model and a neural coding system, so that it can explicitly perform source separation and coding in the latent space. An added benefit of this system is that the codec can allocate a different amount of bits to the underlying sources, so that the more important source sounds better in the decoded signal. We target a new use case where the user on the receiver side cares about the quality of the non-speech components in the speech communication, while the speech source still carries the most important information. Both objective and subjective evaluation tests show that SANAC can recover the original noisy speech better than the baseline neural audio coding system, which is with no source-aware coding mechanism, and two conventional codecs.

show abstract

Advancements in encoded speech data by background noise suppression under uncontrolled environment

Nagaraja,

Yadava,

Anees

2024

Int J Speech Technol

View full text Add to dashboard Cite

Cascaded Cross-Module Residual Learning Towards Lightweight End-to-End Speech Coding

Cited by 31 publications

References 25 publications

Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

Source-Aware Neural Speech Coding for Noisy Speech Compression

Advancements in encoded speech data by background noise suppression under uncontrolled environment

Contact Info

Product

Resources

About