End-To-End Optimized Speech Coding with Deep Neural Networks

Kankanahalli, Srihari

doi:10.1109/icassp.2018.8461487

Cited by 55 publications

(49 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A 1D-CNN architecture on the time-domain samples serves the desired lightweight autoencoder (AE) for end-to-end speech coding, where the model complexity is a major concern [19,18]. As shown in Table 1, the encoder part consists of four bottleneck ResNet stages [20], a downsampling convolutional layer to halve the feature map size in the middle, and then a channel compression layer to create a real-valued code vector of 256 dimensions.…”

Section: End-to-end Speech Coding Autoencodersmentioning

confidence: 99%

“…To compress speech signals, a core component of this AE is the trainable quantizer which learns a discrete representation of the code layer in the AE. Out of the recent neural network-compatible quantization schemes, such as VQ-VAE [21] and soft-to-hard quantization [22], we focus on soft-to-hard quantization, namely softmax quantization as in the other end-to-end speech coding AEs [19,18]. Given an input frame x ∈ R S of S samples, the output from the encoder is h = FEnc(x), each is a 16-bit floating-point value.…”

Section: Soft-to-hard (Softmax) Quantizationmentioning

confidence: 99%

See 1 more Smart Citation

Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

Zhen

Lee

Sung

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Scalability and efficiency are desired in neural speech codecs, which supports a wide range of bitrates for applications on various devices. We propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges the computational capacity of advanced neural network models and traditional, yet efficient and domain-specific digital signal processing methods in an integrated manner. We demonstrate that CQ achieves much higher quality than its predecessor at 9 kbps with even lower model complexity. We also show that CQ can scale up to 24 kbps where it outperforms AMR-WB and Opus. As a neural waveform codec, CQ models are with less than 1 million parameters, significantly less than many other generative models.

show abstract

Section: End-to-end Speech Coding Autoencodersmentioning

confidence: 99%

Section: Soft-to-hard (Softmax) Quantizationmentioning

confidence: 99%

Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

Zhen

Lee

Sung

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Traditional speech codec methods such as CELP [6], Opus [7], and adaptive multirate wideband (AMR-WB) [8] commonly employ hand-engineered encoder-decoder pipelines relying on manually/mathematically crafted audio representation features and/or signal prediction models. Recent DNN based approaches including [9] demonstrate the feasibility to train an end-to-end speech codec that exhibits performance comparable to a hand-crafted AMR-WB codec at 9-24 kbps. [10] uses deep spiking neural networks to realizes a low bit rate speech codec.…”

Section: Related Workmentioning

confidence: 99%

Unified Signal Compression Using Generative Adversarial Networks

Liu

Cao

Kim

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose a unified compression framework that uses generative adversarial networks (GAN) to compress image and speech signals. The compressed signal is represented by a latent vector fed into a generator network which is trained to produce high quality signals that minimize a target objective function. To efficiently quantize the compressed signal, non-uniformly quantized optimal latent vectors are identified by iterative back-propagation with ADMM optimization performed for each iteration. Our experiments show that the proposed algorithm outperforms prior signal compression methods for both image and speech compression quantified in various metrics including bit rate, PSNR, and neural network based signal classification accuracy.

show abstract

“…The TF domain regularizer compensates the end-to-end DNN that would only operate in time domain, otherwise. Empirically, it is shown to achieve better perceptual quality, as proposed in [20].…”

Section: Objective Functionmentioning

confidence: 99%

A Dual-Staged Context Aggregation Method towards Efficient End-to-End Speech Enhancement

Zhen

Lee

Kim

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In speech enhancement, an end-to-end deep neural network converts a noisy speech signal to a clean speech directly in time domain without time-frequency transformation or mask estimation. However, aggregating contextual information from a high-resolution time domain signal with an affordable model complexity still remains challenging. In this paper, we propose a densely connected convolutional and recurrent network (DCCRN), a hybrid architecture, to enable dual-staged temporal context aggregation. With the dense connectivity and cross-component identical shortcut, DCCRN consistently outperforms competing convolutional baselines with an average STOI improvement of 0.23 and PESQ of 1.38 at three SNR levels. The proposed method is computationally efficient with only 1.38 million parameters. The generalizability performance on the unseen noise types is still decent considering its low complexity, although it is relatively weaker comparing to Wave-U-Net with 7.25 times more parameters.

show abstract

End-To-End Optimized Speech Coding with Deep Neural Networks

Cited by 55 publications

References 15 publications

Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

Unified Signal Compression Using Generative Adversarial Networks

A Dual-Staged Context Aggregation Method towards Efficient End-to-End Speech Enhancement

Contact Info

Product

Resources

About