Personalized speech enhancement: new models and Comprehensive evaluation

Eskimez, Şefik Emre; Yoshioka, Takuya; Wang, Huaming; Wang, Xiaofei; Chen, Zhuo; Huang, Xuedong

doi:10.1109/icassp43922.2022.9746962

Cited by 45 publications

(55 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The model was tested based on speech communication, and the ASR accuracy was not considered. Eskimez et al [4] proposed two PSE models, an evaluation metric called target speaker oversuppression (TSOS), and test sets to cover various scenarios. TSOS measures the degree of removal of the target speaker's speech segments and is critical for PSE since removing the target speech hampers effective conversations and degrades the transcription quality, as reported in [8].…”

Section: Related Workmentioning

confidence: 99%

“…TSOS measures the degree of removal of the target speaker's speech segments and is critical for PSE since removing the target speech hampers effective conversations and degrades the transcription quality, as reported in [8]. Furthermore, Taherian et al [5] extended [4] to multi-channel scenarios by proposing a model that works with any microphone numbers and array geometries. Although the models of [4] can run on PCs in realtime, the computational cost was still too high for real usage as the audio processing can use only a tiny fraction of the available resources on devices.…”

Section: Related Workmentioning

confidence: 99%

“…Personalized speech enhancement (PSE) provides an improvement to the general SE approach by using prior knowledge about a target speaker [2,3,4,5]. One exemplary approach to PSE is to extract a speaker embedding vector from a short enrollment audio sample of the target speaker and feed it to an SE model.…”

Section: Introductionmentioning

confidence: 99%

“…End-to-end modeling: We propose a personalized end-to-end enhancement network (E3Net), a faster neural network model that is shown to improve the speech quality, WER, and reduce TSOS than a previously proposed personalized deep complex convolution recurrent network (pDCCRN) [4].…”

Section: Introductionmentioning

confidence: 99%

“…We examine the effect of applying bigger teacher models to the real noisy data and using the outputs as clean references for student models. Combination with multi-task learning (MTL) using ASR transcriptions [4,6] is also considered.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

Thakker¹,

Eskimez²,

Yoshioka³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

This paper investigates how to improve the runtime speed of personalized speech enhancement (PSE) networks while maintaining the model quality. Our approach includes two aspects: architecture and knowledge distillation (KD). We propose an end-to-end enhancement (E3Net) model architecture, which is 3× faster than a baseline STFT-based model. Besides, we use KD techniques to develop compressed student models without significantly degrading quality. In addition, we investigate using noisy data without reference clean signals for training the student models, where we combine KD with multi-task learning (MTL) using an automatic speech recognition (ASR) loss. Our results show that E3Net provides better speech and transcription quality with a lower target speaker over-suppression (TSOS) rate than the baseline model. Furthermore, we show that the KD methods can yield student models that are 2 − 4× faster than the teacher and provides reasonable quality. Combining KD and MTL improves the ASR and TSOS metrics without degrading the speech quality.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

Thakker¹,

Eskimez²,

Yoshioka³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Single-Channel Speech Enhancement Using Single Dimension Change Accelerated Particle Swarm Optimization for Subspace Partitioning

Ghorpade¹,

Khaparde

2023

Circuits Syst Signal Process

View full text Add to dashboard Cite

NeuProNet: neural profiling networks for sound classification

Tran,

Vu,

Nguyen

et al. 2024

Neural Comput & Applic

View full text Add to dashboard Cite

Real-world sound signals exhibit various aspects of grouping and profiling behaviors, such as being recorded from identical sources, having similar environmental settings, or encountering related background noises. In this work, we propose novel neural profiling networks (NeuProNet) capable of learning and extracting high-level unique profile representations from sounds. An end-to-end framework is developed so that any backbone architectures can be plugged in and trained, achieving better performance in any downstream sound classification tasks. We introduce an in-batch profile grouping mechanism based on profile awareness and attention pooling to produce reliable and robust features with contrastive learning. Furthermore, extensive experiments are conducted on multiple benchmark datasets and tasks to show that neural computing models under the guidance of our framework gain significant performance gaps across all evaluation tasks. Particularly, the integration of NeuProNet surpasses recent state-of-the-art (SoTA) approaches on UrbanSound8K and VocalSound datasets with statistically significant improvements in benchmarking metrics, up to 5.92% in accuracy compared to the previous SoTA method and up to 20.19% compared to baselines. Our work provides a strong foundation for utilizing neural profiling for machine learning tasks.

show abstract

Personalized speech enhancement: new models and Comprehensive evaluation

Cited by 45 publications

References 24 publications

Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

Single-Channel Speech Enhancement Using Single Dimension Change Accelerated Particle Swarm Optimization for Subspace Partitioning

NeuProNet: neural profiling networks for sound classification

Contact Info

Product

Resources

About