SDR – Half-baked or Well Done?

Roux, Jonathan Le; Wisdom, Scott; Erdoğan, Hakan; Hershey, John R.

doi:10.1109/icassp.2019.8683855

Cited by 752 publications

(525 citation statements)

References 29 publications

Supporting

Mentioning

484

Contrasting

Unclassified

Order By: Relevance

“…Following the common speech separation metrics [12,21], we adopt average SI-SDR and SDR improvement over mixture as the evaluation metrics. We also report the performances under different ranges of angle difference between speakers to give a more comprehensive assessment for the model.…”

Section: Results Analysismentioning

confidence: 99%

See 1 more Smart Citation

Enhancing End-to-End Multi-Channel Speech Separation Via Spatial Feature Learning

Zhang

Chen

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Hand-crafted spatial features (e.g., inter-channel phase difference, IPD) play a fundamental role in recent deep learning based multi-channel speech separation (MCSS) methods. However, these manually designed spatial features are hard to incorporate into the end-to-end optimized MCSS framework. In this work, we propose an integrated architecture for learning spatial features directly from the multi-channel speech waveforms within an end-to-end speech separation framework. In this architecture, time-domain filters spanning signal channels are trained to perform adaptive spatial filtering. These filters are implemented by a 2d convolution (conv2d) layer and their parameters are optimized using a speech separation objective function in a purely data-driven fashion. Furthermore, inspired by the IPD formulation, we design a conv2d kernel to compute the inter-channel convolution differences (ICDs), which are expected to provide the spatial cues that help to distinguish the directional sources. Evaluation results on simulated multi-channel reverberant WSJ0 2-mix dataset demonstrate that our proposed ICD based MCSS model improves the overall signal-to-distortion ratio by 10.4% over the IPD based MCSS model.

show abstract

Section: Results Analysismentioning

confidence: 99%

“…Finally, the decoder reconstructs the separated speech waveform from the masked mixture encode for each speaker. To optimize the network end-to-end, scale-invariant signal-to-distortion ratio (SI-SDR) [12] is utilized as the training objective:…”

Section: Multi-channel Speech Separationmentioning

confidence: 99%

Enhancing End-to-End Multi-Channel Speech Separation Via Spatial Feature Learning

Zhang

Chen

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Time-domain loss (TDL): For time-domain networks, we employ the classic signal-to-noise ratio (SNR) [23] as time-domain loss,…”

Section: Training Lossesmentioning

confidence: 99%

“…where θ are the model parameters, SNR = −10 log 10 ( ||x|| 2 ||x−x|| 2 ) is the SNR between the clean speech and the enhanced speech, and || · || 2 is the L 2 norm. We decided to use the classic SNR loss [23] instead of the scale-invariant SNR (SiSNR) used in the original Tas-Net [17], because training the network with SiSNR let the network freely change the level of the enhanced signal. With the SNR loss, the scale of the signals is preserved avoiding any scaling requirement when passing the signal to ASR.…”

Section: Training Lossesmentioning

confidence: 99%

Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network

Kinoshita

Ochiai

Delcroix

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

With the advent of deep learning, research on noise-robust automatic speech recognition (ASR) has progressed rapidly. However, ASR performance in noisy conditions of single-channel systems remains unsatisfactory. Indeed, most single-channel speech enhancement (SE) methods (denoising) have brought only limited performance gains over state-of-the-art ASR back-end trained on multicondition training data. Recently, there has been much research on neural network-based SE methods working in the time-domain showing levels of performance never attained before. However, it has not been established whether the high enhancement performance achieved by such time-domain approaches could be translated into ASR. In this paper, we show that a single-channel time-domain denoising approach can significantly improve ASR performance, providing more than 30 % relative word error reduction over a strong ASR back-end on the real evaluation data of the single-channel track of the CHiME-4 dataset. These positive results demonstrate that single-channel noise reduction can still improve ASR performance, which should open the door to more research in that direction.

show abstract

“…The three composite measures CSIG, CBAK, and COVL are the popular predictor of the mean opinion score (MOS) of the target signal distortion, background noise interference, and overall speech quality, respectively [30]. In addition, as the standard metric in speech enhancement, we also evaluated scale-invariant SDR (SI-SDR) [31]. Table 1 shows the experimental results.…”

Section: Openmentioning

confidence: 99%

Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention

Koizumi

Yatabe

Delcroix

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

107

View full text Add to dashboard Cite

This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features; we extract a speaker representation used for adaptation directly from the test utterance. Conventional studies of deep neural network (DNN)-based speech enhancement mainly focus on building a speaker independent model. Meanwhile, in speech applications including speech recognition and synthesis, it is known that model adaptation to the target speaker improves the accuracy. Our research question is whether a DNN for speech enhancement can be adopted to unknown speakers without any auxiliary guidance signal in test-phase. To achieve this, we adopt multi-task learning of speech enhancement and speaker identification, and use the output of the final hidden layer of speaker identification branch as an auxiliary feature. In addition, we use multi-head self-attention for capturing long-term dependencies in the speech and noise. Experimental results on a public dataset show that our strategy achieves the state-of-the-art performance and also outperform conventional methods in terms of subjective quality.

show abstract

SDR – Half-baked or Well Done?

Cited by 752 publications

References 29 publications

Enhancing End-to-End Multi-Channel Speech Separation Via Spatial Feature Learning

Enhancing End-to-End Multi-Channel Speech Separation Via Spatial Feature Learning

Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network

Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention

Contact Info

Product

Resources

About