Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

Thakker, Manthan; Eskimez, Şefik Emre; Yoshioka, Takuya; Wang, Huaming

doi:10.48550/arxiv.2204.00771

Cited by 3 publications

(4 citation statements)

References 23 publications

(44 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The results are presented in Table 12. For the baseline we implemented the method according to [49], and achieved results close to the original research in terms of speech clarity. As the evaluation scores were similar when the model size was set to 25%, we performed a two-tailed t-test between the results of fixed KD, Method C, and non-KD to assess their significance, as shown in Table 13 and Table 14.…”

Section: ) Comparison Of Cstr Vctk Datasetmentioning

confidence: 66%

Speech Enhancement Using Dynamic Learning in Knowledge Distillation via Reinforcement Learning

Chu,

Wu,

2023

IEEE Access

View full text Add to dashboard Cite

In recent years, most of the research on speech enhancement (SE) has applied different strategies to improve performance through deep neural network models. However, as the performance improves, the memory resources and computational requirements of the model also increase, making it difficult to directly apply them to edge computing. Therefore, various model compression and acceleration techniques are desired. This paper proposes a learning method that dynamically uses Knowledge Distillation (KD) to teach a small student model from a large teacher model by considering the learning ratio from the teacher's output and the real target based on reinforcement learning (RL). During the KD learning process, RL is adopted to estimate the learning ratio by considering the reward favoring the hard target (clean speech) or the soft target (the output of the teacher model) during the training of KD. The proposed method results in a more stable training process for the resulting smaller SE model and yields improved performance. In the experiment, we used the TIMIT and CSTR VCTK datasets and evaluated two representative SE models that employ different loss functions. On the TIMIT dataset, when we reduced the number of parameters in the Wave-U-Net student model from 10.3 million to 2.6 million, our method performed better than non-KD models with improvements of 0.05 in PESQ, 0.1 in STOI, and 0.47 in the scale-invariant signal-to-distortion ratio. Moreover, by utilizing prior knowledge from the pre-trained teacher model, our method effectively guided the learning process of the student model, achieving excellent performance even under low SNR conditions. Furthermore, we use Conv-Tasnet to further validate our proposed method. Finally, for ease of comparison, we conducted a comparison on the VCTK dataset as well.

show abstract

Section: ) Comparison Of Cstr Vctk Datasetmentioning

confidence: 66%

Speech Enhancement Using Dynamic Learning in Knowledge Distillation via Reinforcement Learning

Chu,

Wu,

2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Knowledge distillation (KD) [ 94 ], known also as teacher–student training, refers to training small DNN models by supervisions generated by computationally demanding teacher models. Low-cost E3Net [ 95 ] also use KD to leverage unpaired noisy samples. E3Net outperformed earlier networks proposed by authors with a three times reduction in computational cost.…”

Section: Techniques For the Reduction In Computational And Memory Req...mentioning

confidence: 99%

“…It was reported that tensor decomposition gives better STOI than pruning for the same compression rate. KD can lead to reductions in the number of computations (2-4 times), with slight degradation in quality metrics [ 95 ]. It was not extensively tested for speech enhancement.…”

Section: Techniques For the Reduction In Computational And Memory Req...mentioning

confidence: 99%

A Survey on Low-Latency DNN-Based Speech Enhancement

Drgas

2023

Sensors

View full text Add to dashboard Cite

This paper presents recent advances in low-latency, single-channel, deep neural network-based speech enhancement systems. The sources of latency and their acceptable values in different applications are described. This is followed by an analysis of the constraints imposed on neural network architectures. Specifically, the causal units used in deep neural networks are presented and discussed in the context of their properties, such as the number of parameters, the receptive field, and computational complexity. This is followed by a discussion of techniques used to reduce the computational complexity and memory requirements of the neural networks used in this task. Finally, the techniques used by the winners of the latest speech enhancement challenges (DNS, Clarity) are shown and compared.

show abstract

“…Personalization has shown promising results in model compression tasks for speech enhancement [15,16,17,18]. A personalized model adapts to the target speaker group's speech trait, narrowing the training task down to a smaller subtask, i.e., defined by the smaller speaker group than the entire speakers in the corpus.…”

Section: Introductionmentioning

confidence: 99%

Personalized Neural Speech Codec

Jang,

Yang,

Lim

et al. 2024

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, we propose a personalized neural speech codec, envisioning that personalization can reduce the model complexity or improve perceptual speech quality. Despite the common usage of speech codecs where only a single talker is involved on each side of the communication, personalizing a codec for the specific user has rarely been explored in the literature. First, we assume speakers can be grouped into smaller subsets based on their perceptual similarity. Then, we also postulate that a group-specific codec can focus on the group's speech characteristics to improve its perceptual quality and computational efficiency. To this end, we first develop a Siamese network that learns the speaker embeddings from the LibriSpeech dataset, which are then grouped into underlying speaker clusters. Finally, we retrain the LPCNet-based speech codec baselines on each of the speaker clusters. Subjective listening tests show that the proposed personalization scheme introduces model compression while maintaining speech quality. In other words, with the same model complexity, personalized codecs produce better speech quality.

show abstract

Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

Cited by 3 publications

References 23 publications

Speech Enhancement Using Dynamic Learning in Knowledge Distillation via Reinforcement Learning

Speech Enhancement Using Dynamic Learning in Knowledge Distillation via Reinforcement Learning

A Survey on Low-Latency DNN-Based Speech Enhancement

Personalized Neural Speech Codec

Contact Info

Product

Resources

About