Masking and Inpainting: A Two-Stage Speech Enhancement Approach for Low SNR and Non-Stationary Noise

Hao, Xiang; Su, Xiangdong; Wen, Shixue; Wang, Zhiyu; Pan, Yi-Qian; Bao, Feilong; Chen, Wei

doi:10.1109/icassp40776.2020.9053188

Cited by 35 publications

(15 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The reason might be that the data augmentations and joint performing super-resolution can increase the generalization and inpainting ability of the model (Hao et al, 2020). The PESQ score of VF-UNet reaches 2.43, higher than SEGAN, WaveUNet, and the model trained with weakly labeled data in Kong et al (2021b).…”

Section: Super-resolutionmentioning

confidence: 98%

VoiceFixer: Toward General Speech Restoration with Neural Vocoder

Liu,

Kong,

Tian

et al. 2021

Preprint

View full text Add to dashboard Cite

Speech restoration aims to remove distortions in speech signals. Prior methods mainly focus on single-task speech restoration (SSR), such as speech denoising or speech declipping. However, SSR systems only focus on one task and do not address the general speech restoration problem. In addition, previous SSR systems show limited performance in some speech restoration tasks such as speech super-resolution. To overcome those limitations, we propose a general speech restoration (GSR) task that attempts to remove multiple distortions simultaneously. Furthermore, we propose VoiceFixer 1 , a generative framework to address the GSR task. VoiceFixer consists of an analysis stage and a synthesis stage to mimic the speech analysis and comprehension of the human auditory system. We employ a ResUNet to model the analysis stage and a neural vocoder to model the synthesis stage. We evaluate VoiceFixer with additive noise, room reverberation, low-resolution, and clipping distortions. Our baseline GSR model achieves a 0.499 higher mean opinion score (MOS) than the speech denoising SSR model. VoiceFixer further surpasses the GSR baseline model on MOS score by 0.256. Moreover, we observe that VoiceFixer generalizes well to severely degraded real speech recordings, indicating its potential in restoring old movies and historical speeches. The source code is available at https://github.com/haoheliu/voicefixer_main.

show abstract

Section: Super-resolutionmentioning

confidence: 98%

VoiceFixer: Toward General Speech Restoration with Neural Vocoder

Liu,

Kong,

Tian

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…They evaluated their systems on long gaps (about 500 ms), while in our work we aim at inpainting also extremely long segments (until 1600 ms), where additional information, like video, is essential to correctly restore speech signals. A very recent work proposed a two-stage enhancement network where binary masking of a noisy speech spectrogram was followed by inpainting of time-frequency bins affected by severe noise [16].…”

Section: Introductionmentioning

confidence: 99%

Audio-Visual Speech Inpainting with Deep Learning

Morrone

Michelsanti

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision.

show abstract

“…Recently, multi-stage learning has been successfully applied for a wide variety of tasks, including human pose estimation [28], action segmentation [29], speech enhancement [30]- [32] and speech separation [33]. A multi-stage architecture consists of stages that sequentially use the same model or a combination of different models, and each model operates directly on the output of the previous stage.…”

mentioning

confidence: 99%

“…Multi-stage learning systems where each stage performs a different task are considered in [30], [31], [33]. Here, each stage has a different task and a different target.…”

mentioning

confidence: 99%

“…The performance can be improved by aggregating different stages if the nature of each stage is complementary. For instance, a two-stage speech enhancement approach is presented in [30], where the first stage uses a model to predict a binary mask to remove frequency bins that are dominated by severe noise, and where the second stage performs in-painting of the masked spectrogram from the first stage to recover the speech spectrogram that was removed in the first stage. In [31], a two-stage algorithm is proposed to optimize the magnitude and phase separately.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Speech Enhancement Using Multi-Stage Self-Attentive Temporal Convolutional Networks

Lin

Wijngaarden

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

Multi-stage learning is an effective technique to invoke multiple deep-learning modules sequentially. This paper applies multi-stage learning to speech enhancement by using a multi-stage structure, where each stage comprises a selfattention (SA) block followed by stacks of temporal convolutional network (TCN) blocks with doubling dilation factors. Each stage generates a prediction that is refined in a subsequent stage. A fusion block is inserted at the input of later stages to reinject original information. The resulting multi-stage speech enhancement system, in short, multi-stage SA-TCN, is compared with state-of-the-art deep-learning speech enhancement methods using the LibriSpeech and VCTK data sets. The multi-stage SA-TCN system's hyper-parameters are fine-tuned, and the impact of the SA block, the fusion block and the number of stages are determined. The use of a multi-stage SA-TCN system as a frontend for automatic speech recognition systems is investigated as well. It is shown that the multi-stage SA-TCN systems perform well relative to other state-of-the-art systems in terms of speech enhancement and speech recognition scores.

show abstract

Masking and Inpainting: A Two-Stage Speech Enhancement Approach for Low SNR and Non-Stationary Noise

Cited by 35 publications

References 10 publications

VoiceFixer: Toward General Speech Restoration with Neural Vocoder

VoiceFixer: Toward General Speech Restoration with Neural Vocoder

Audio-Visual Speech Inpainting with Deep Learning

Speech Enhancement Using Multi-Stage Self-Attentive Temporal Convolutional Networks

Contact Info

Product

Resources

About