Fullsubnet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement

Hao, Xiang; Su, Xiangdong; Horaud, Radu; Li, Xiaofei

doi:10.1109/icassp39728.2021.9414177

Cited by 125 publications

(57 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In addition, the inter-channel magnitude/intensity difference plays an especially important role for binaural localization, as the intensity difference of binaural signals can reflect the torso/head shadow effect of signal propagation. In order to promote the localization performance, the recently proposed FullSubNet [16] is adopted to predict the complex ideal ratio mask and enhance the complex speech spectrograms. Accounting for the following DP-RTF learning, the clean direct-path sound is taken as the target signal, which means both noise reduction and dereverberation are conducted.…”

Section: Monaural Enhancementmentioning

confidence: 99%

“…The enhanced speech would be definitely helpful for DP-RTF estimation. In this work, we adopt the network architecture of the monaural speech enhancement method in [16]. This enhancement method is modified to recover the clean directpath magnitude and phase spectrograms from the contaminated ones, instead of recovering the noise-free signals.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Learning Deep Direct-Path Relative Transfer Function for Binaural Sound Source Localization

Yang

Liu

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Direct-path relative transfer function (DP-RTF) refers to the ratio between the direct-path acoustic transfer functions of two microphone channels. Though DP-RTF fully encodes the sound spatial cues and serves as a reliable localization feature, it is often erroneously estimated in the presence of noise and reverberation. This paper proposes to learn DP-RTF with deep neural networks for robust binaural sound source localization. A DP-RTF learning network is designed to regress the binaural sensor signals to a real-valued representation of DP-RTF. It consists of a branched convolutional neural network module to separately extract the inter-channel magnitude and phase patterns, and a convolutional recurrent neural network module for joint feature learning. To better explore the speech spectra to aid the DP-RTF estimation, a monaural speech enhancement network is used to recover the direct-path spectrograms from the noisy ones. The enhanced spectrograms are stacked onto the noisy spectrograms to act as the input of the DP-RTF learning network. We train one unique DP-RTF learning network using many different binaural arrays to enable the generalization of DP-RTF learning across arrays. This way avoids time-consuming training data collection and network retraining for a new array, which is very useful in practical application. Experimental results on both simulated and real-world data show the effectiveness of the proposed method for direction of arrival (DOA) estimation in the noisy and reverberant environment, and a good generalization ability to unseen binaural arrays.

show abstract

Section: Monaural Enhancementmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Learning Deep Direct-Path Relative Transfer Function for Binaural Sound Source Localization

Yang

Liu

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…As the authors did not use a studio with sound isolation, the dataset contains some environment noise. For our experiments we resample the audios to 16Khz and use the FullSubNet model [34] as denoiser. For development we randomly selected 500 samples and the rest of the dataset was used for training.…”

Section: Audio Datasetsmentioning

confidence: 99%

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Casanova¹,

Weber²,

Shulby³

et al. 2021

Preprint

View full text Add to dashboard Cite

YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zeroshot multi-speaker and multilingual training. We achieved stateof-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zeroshot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.

show abstract

“…We utilized the small part in our experiments, specifically, 9h worth (2477 utterances) for training and the remaining 1h (286 utterances) for inference and evaluation. This division between the training and evaluation sets, i.e., into 9h and 1h sets, is given in [24] and its code 3 .…”

Section: Libri-light [24]mentioning

confidence: 99%

“…For instance, convolutional neural networks (CNNs) [1] have been shown to be better than using a short-time Fourier transform (STFT) and inverse STFT (ISTFT) for building an encoder and decoder [2]. Furthermore, methods that utilize recurrent neural networks (RNNs)-based models have been shown to be capable of real-time processing [3][4][5]. In addition, there are hybrid methods that exploit the benefits of both types of network, i.e., real-time processing and high performance [6,7].…”

Section: Introductionmentioning

confidence: 99%

Improving Character Error Rate Is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-box Acoustic Models

Sawata¹,

Kashiwagi²,

Takahashi³

2021

Preprint

View full text Add to dashboard Cite

A deep neural network (DNN)-based speech enhancement (SE) aiming to maximize the performance of an automatic speech recognition (ASR) system is proposed in this paper. In order to optimize the DNN-based SE model in terms of the character error rate (CER), which is one of the metric to evaluate the ASR system and generally non-differentiable, our method uses two DNNs: one for speech processing and one for mimicking the output CERs derived through an acoustic model (AM). Then both of DNNs are alternately optimized in the training phase. Even if the AM is a black-box, e.g., like one provided by a third-party, the proposed method enables the DNN-based SE model to be optimized in terms of the CER since the DNN mimicking the AM is differentiable. Consequently, it becomes feasible to build CER-centric SE model that has no negative effect, e.g., additional calculation cost and changing network architecture, on the inference phase since our method is merely a training scheme for the existing DNN-based methods. Experimental results show that our method improved CER by 7.3% relative derived through a black-box AM although certain noise levels are kept.

show abstract

Fullsubnet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement

Cited by 125 publications

References 22 publications

Learning Deep Direct-Path Relative Transfer Function for Binaural Sound Source Localization

Learning Deep Direct-Path Relative Transfer Function for Binaural Sound Source Localization

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Improving Character Error Rate Is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-box Acoustic Models

Contact Info

Product

Resources

About