Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1816
|View full text |Cite
|
Sign up to set email alerts
|

Cascaded Cross-Module Residual Learning Towards Lightweight End-to-End Speech Coding

Abstract: Speech codecs learn compact representations of speech signals to facilitate data transmission. Many recent deep neural network (DNN) based end-to-end speech codecs achieve low bitrates and high perceptual quality at the cost of model complexity. We propose a cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules. CMRL differs from other DNN-based speech codecs, in that rather than modeling speech compression problem in a single… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
23
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
1

Relationship

2
3

Authors

Journals

citations
Cited by 31 publications
(23 citation statements)
references
References 25 publications
0
23
0
Order By: Relevance
“…A 1D-CNN architecture on the time-domain samples serves the desired lightweight autoencoder (AE) for end-to-end speech coding, where the model complexity is a major concern [19,18]. As shown in Table 1, the encoder part consists of four bottleneck ResNet stages [20], a downsampling convolutional layer to halve the feature map size in the middle, and then a channel compression layer to create a real-valued code vector of 256 dimensions.…”
Section: End-to-end Speech Coding Autoencodersmentioning
confidence: 99%
See 3 more Smart Citations
“…A 1D-CNN architecture on the time-domain samples serves the desired lightweight autoencoder (AE) for end-to-end speech coding, where the model complexity is a major concern [19,18]. As shown in Table 1, the encoder part consists of four bottleneck ResNet stages [20], a downsampling convolutional layer to halve the feature map size in the middle, and then a channel compression layer to create a real-valued code vector of 256 dimensions.…”
Section: End-to-end Speech Coding Autoencodersmentioning
confidence: 99%
“…To compress speech signals, a core component of this AE is the trainable quantizer which learns a discrete representation of the code layer in the AE. Out of the recent neural network-compatible quantization schemes, such as VQ-VAE [21] and soft-to-hard quantization [22], we focus on soft-to-hard quantization, namely softmax quantization as in the other end-to-end speech coding AEs [19,18]. Given an input frame x ∈ R S of S samples, the output from the encoder is h = FEnc(x), each is a 16-bit floating-point value.…”
Section: Soft-to-hard (Softmax) Quantizationmentioning
confidence: 99%
See 2 more Smart Citations
“…Kankanahalli proposes a model that consists of fully convolutional layers to integrate dimension reduction, quantization, and entropy control tasks [6]. Cross-module residual learning (CMRL) inherits the convolutional pipeline and proposes a cascading structure, where multiple autoencoders are concatenated to work on the residual signal produced by the preceding ones [7]. In [8], CMRL is coupled with a trainable linear predictive coding (LPC) module as a preprocessor.…”
Section: Introductionmentioning
confidence: 99%