One Model to Enhance Them All: Array Geometry Agnostic Multi-Channel Personalized Speech Enhancement

Taherian, Hassan; Eskimez, Şefik Emre; Yoshioka, Takuya; Wang, Huaming; Chen, Zhuo; Huang, Xuedong

doi:10.1109/icassp43922.2022.9747395

Cited by 15 publications

(17 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Missing multi-stream data has been used to attain better performance on tasks such as speech enhancement (Taherian et al, 2022). However, speech enhancement is only partially dependent on spatial information, which is harder to recover.…”

Section: Related Workmentioning

confidence: 99%

RecNet: Early Attention Guided Feature Recovery

Biswas¹,

Islam²

2023

Preprint

View full text Add to dashboard Cite

Uncertainty in sensors results in corrupted input streams and hinders the performance of Deep Neural Networks (DNN), which focus on deducing information from data. However, for sensors with multiple input streams, the relevant information among the streams correlates and hence contains mutual information. This paper utilizes this opportunity to recover the perturbed information due to corrupted input streams. We propose RecNet, which estimates the information entropy at every element of the input feature to the network and interpolates the missing information in the input feature matrix. Finally, using the estimated information entropy and interpolated data, we introduce a novel guided replacement procedure to recover the complete information that is the input to the downstream DNN task. We evaluate the proposed algorithm on a sound event detection and localization application where audio streams from the microphone array are corrupted. We have recovered the performance drop due to the corrupted input stream and reduced the localization error with non-corrupted input streams.

show abstract

Section: Related Workmentioning

confidence: 99%

RecNet: Early Attention Guided Feature Recovery

Biswas¹,

Islam²

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…TSOS measures the degree of removal of the target speaker's speech segments and is critical for PSE since removing the target speech hampers effective conversations and degrades the transcription quality, as reported in [8]. Furthermore, Taherian et al [5] extended [4] to multi-channel scenarios by proposing a model that works with any microphone numbers and array geometries. Although the models of [4] can run on PCs in realtime, the computational cost was still too high for real usage as the audio processing can use only a tiny fraction of the available resources on devices.…”

Section: Related Workmentioning

confidence: 99%

“…Personalized speech enhancement (PSE) provides an improvement to the general SE approach by using prior knowledge about a target speaker [2,3,4,5]. One exemplary approach to PSE is to extract a speaker embedding vector from a short enrollment audio sample of the target speaker and feed it to an SE model.…”

Section: Introductionmentioning

confidence: 99%

Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

Thakker¹,

Eskimez²,

Yoshioka³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

This paper investigates how to improve the runtime speed of personalized speech enhancement (PSE) networks while maintaining the model quality. Our approach includes two aspects: architecture and knowledge distillation (KD). We propose an end-to-end enhancement (E3Net) model architecture, which is 3× faster than a baseline STFT-based model. Besides, we use KD techniques to develop compressed student models without significantly degrading quality. In addition, we investigate using noisy data without reference clean signals for training the student models, where we combine KD with multi-task learning (MTL) using an automatic speech recognition (ASR) loss. Our results show that E3Net provides better speech and transcription quality with a lower target speaker over-suppression (TSOS) rate than the baseline model. Furthermore, we show that the KD methods can yield student models that are 2 − 4× faster than the teacher and provides reasonable quality. Combining KD and MTL improves the ASR and TSOS metrics without degrading the speech quality.

show abstract

“…Several studies developed causal PSE models utilizing a speaker embedding vector to extract the target speaker's voice. [1,2,3,7,8]. Giri et al proposed a perceptually motivated PSE model with low complexity [2].…”

Section: Related Workmentioning

confidence: 99%

“…Meanwhile, personalized speech enhancement (PSE) is gaining increased attention from the research community. PSE utilizes additional cues such as a speaker embedding vector of a target speaker to enhance only the speaker's signal even when interfering speech and background noise are both present [1,2,3]. The PSE task may be regarded as a combination of speech separation, enhancement, and speaker verification tasks.…”

Section: Introductionmentioning

confidence: 99%

Breaking the trade-off in personalized speech enhancement with cross-task knowledge distillation

Taherian¹,

Eskimez²,

Yoshioka³

2022

Preprint

View full text Add to dashboard Cite

Personalized speech enhancement (PSE) models achieve promising results compared with unconditional speech enhancement models due to their ability to remove interfering speech in addition to background noise. Unlike unconditional speech enhancement, causal PSE models may occasionally remove the target speech by mistake. The PSE models also tend to leak interfering speech when the target speaker is silent for an extended period. We show that existing PSE methods suffer from a trade-off between speech oversuppression and interference leakage by addressing one problem at the expense of the other. We propose a new PSE model training framework using cross-task knowledge distillation to mitigate this trade-off. Specifically, we utilize a personalized voice activity detector (pVAD) during training to exclude the non-target speech frames that are wrongly identified as containing the target speaker with hard or soft classification. This prevents the PSE model from being too aggressive while still allowing the model to learn to suppress the input speech when it is likely to be spoken by interfering speakers. Comprehensive evaluation results are presented, covering various PSE usage scenarios.

show abstract

One Model to Enhance Them All: Array Geometry Agnostic Multi-Channel Personalized Speech Enhancement

Cited by 15 publications

References 21 publications

RecNet: Early Attention Guided Feature Recovery

RecNet: Early Attention Guided Feature Recovery

Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

Breaking the trade-off in personalized speech enhancement with cross-task knowledge distillation

Contact Info

Product

Resources

About