2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8461487
|View full text |Cite
|
Sign up to set email alerts
|

End-To-End Optimized Speech Coding with Deep Neural Networks

Abstract: Modern compression algorithms are often the result of laborious domain-specific research; industry standards such as MP3, JPEG, and AMR-WB took years to develop and were largely hand-designed. We present a deep neural network model which optimizes all the steps of a wideband speech coding pipeline (compression, quantization, entropy coding, and decompression) end-to-end directly from raw speech data -no manual feature engineering necessary, and it trains in hours. In testing, our DNN-based coder performs on pa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
46
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 55 publications
(49 citation statements)
references
References 15 publications
1
46
0
Order By: Relevance
“…A 1D-CNN architecture on the time-domain samples serves the desired lightweight autoencoder (AE) for end-to-end speech coding, where the model complexity is a major concern [19,18]. As shown in Table 1, the encoder part consists of four bottleneck ResNet stages [20], a downsampling convolutional layer to halve the feature map size in the middle, and then a channel compression layer to create a real-valued code vector of 256 dimensions.…”
Section: End-to-end Speech Coding Autoencodersmentioning
confidence: 99%
See 1 more Smart Citation
“…A 1D-CNN architecture on the time-domain samples serves the desired lightweight autoencoder (AE) for end-to-end speech coding, where the model complexity is a major concern [19,18]. As shown in Table 1, the encoder part consists of four bottleneck ResNet stages [20], a downsampling convolutional layer to halve the feature map size in the middle, and then a channel compression layer to create a real-valued code vector of 256 dimensions.…”
Section: End-to-end Speech Coding Autoencodersmentioning
confidence: 99%
“…To compress speech signals, a core component of this AE is the trainable quantizer which learns a discrete representation of the code layer in the AE. Out of the recent neural network-compatible quantization schemes, such as VQ-VAE [21] and soft-to-hard quantization [22], we focus on soft-to-hard quantization, namely softmax quantization as in the other end-to-end speech coding AEs [19,18]. Given an input frame x ∈ R S of S samples, the output from the encoder is h = FEnc(x), each is a 16-bit floating-point value.…”
Section: Soft-to-hard (Softmax) Quantizationmentioning
confidence: 99%
“…Traditional speech codec methods such as CELP [6], Opus [7], and adaptive multirate wideband (AMR-WB) [8] commonly employ hand-engineered encoder-decoder pipelines relying on manually/mathematically crafted audio representation features and/or signal prediction models. Recent DNN based approaches including [9] demonstrate the feasibility to train an end-to-end speech codec that exhibits performance comparable to a hand-crafted AMR-WB codec at 9-24 kbps. [10] uses deep spiking neural networks to realizes a low bit rate speech codec.…”
Section: Related Workmentioning
confidence: 99%
“…The TF domain regularizer compensates the end-to-end DNN that would only operate in time domain, otherwise. Empirically, it is shown to achieve better perceptual quality, as proposed in [20].…”
Section: Objective Functionmentioning
confidence: 99%