A Regression Approach to Speech Enhancement Based on Deep Neural Networks

Source separation is the task to separate an audio recording into individual sound sources. Source separation is fundamental for computational auditory scene analysis. Previous work on source separation has focused on separating particular sound classes such as speech and music. Many of previous work require mixture and clean source pairs for training. In this work, we propose a source separation framework trained with weakly labelled data. Weakly labelled data only contains the tags of an audio clip, without the occurrence time of sound events. We first train a sound event detection system with Au-dioSet. The trained sound event detection system is used to detect segments that are mostly like to contain a target sound event. Then a regression is learnt from a mixture of two randomly selected segments to a target segment conditioned on the audio tagging prediction of the target segment. Our proposed system can separate 527 kinds of sound classes from AudioSet within a single system. A U-Net is adopted for the separation system and achieves an average SDR of 5.67 dB over 527 sound classes in AudioSet.Index Terms-Source separation, weakly labelled data, computational auditory scene analysis, AudioSet.

show abstract

“…For music source separation, s k can be vocal or accompanies. In this work, we build f in the time-frequency (T-F) domain [6,11].…”

Section: Regression Based Source Separationmentioning

confidence: 99%

Source Separation with Weakly Labelled Data: an Approach to Computational Auditory Scene Analysis

Kong

Wang

Song

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Xu et al [6] first proposed to use the mapping based approach to reduce the effect of background noise. Specifically, they trained a deep neural network (DNN) to learn a nonlinear mapping from the noisy speech spectral magnitude to that of the clean speech.…”

Section: Dnn Based Mappingmentioning

confidence: 99%

37‐2: Invited Paper: Enhancing Speech in Noisy and Reverberant Environments Using Deep Learning Techniques

Tao

Bhowmik

2018

Symp Digest of Tech Papers

View full text Add to dashboard Cite

Sound signals play a crucial role in immersive perceptual experiences, such as virtual and augmented reality applications and hearing assistant devices. Traditional approaches enhance speech by estimating the background noise or the speech based on its statistics or a parametric model. However, the performance of such an approach has plateaued due to mismatches between its assumptions and actual background noise and speech. Recently, deep learning (DL) has been applied to solve such a challenging problem by taking advantage of its ability to learn a nonlinear mapping and to recognize a pattern without making explicit assumptions about the background noise or speech. In this paper, we will provide a systematic review of single-microphone DL-based speech enhancement approaches. Through an analysis of their advantages and disadvantages, we will provide some insight into future research directions for speech enhancement for hearing devices. Author KeywordsSupervised speech enhancement; denoising; dereverberation; ideal ratio mask; speech intelligibility; long short-term memory; recurrent neural networks; deep neural networks.

show abstract

“…The performance of speech enhancement using T-F masking is affected by both T-F mask estimator and T-F transform. The recent advance of T-F mask estimator is brought by DNN-based T-F mask estimation methods [1][2][3][4][5][6][7][8][9][10][11]. While DNN-based T-F masking is ordinarily applied in short-time Fourier transform (STFT) domain, some methods designed a specific T-F transform for assisting T-F mask estimation and investigated optimal T-F domain for speech enhancement [12,13].…”

Section: Introductionmentioning

confidence: 99%

Invertible DNN-Based Nonlinear Time-Frequency Transform for Speech Enhancement

Takeuchi

Yatabe

Oikawa

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose an end-to-end speech enhancement method with trainable time-frequency (T-F) transform based on invertible deep neural network (DNN). The resent development of speech enhancement is brought by using DNN. The ordinary DNN-based speech enhancement employs T-F transform, typically the short-time Fourier transform (STFT), and estimates a T-F mask using DNN. On the other hand, some methods have considered end-to-end networks which directly estimate the enhanced signals without T-F transform. While end-to-end methods have shown promising results, they are black boxes and hard to understand. Therefore, some end-to-end methods used a DNN to learn the linear T-F transform which is much easier to understand. However, the learned transform may not have a

show abstract

A Regression Approach to Speech Enhancement Based on Deep Neural Networks

Cited by 1,128 publications

References 38 publications

Source Separation with Weakly Labelled Data: an Approach to Computational Auditory Scene Analysis

Source Separation with Weakly Labelled Data: an Approach to Computational Auditory Scene Analysis

37‐2: Invited Paper: Enhancing Speech in Noisy and Reverberant Environments Using Deep Learning Techniques

Invertible DNN-Based Nonlinear Time-Frequency Transform for Speech Enhancement

Contact Info

Product

Resources

About