Towards Efficient Models for Real-Time Deep Noise Suppression

Braun, Sebastian; Gamper, Hannes; Reddy, Chandan K.; Tashev, Ivan

doi:10.1109/icassp39728.2021.9413580

Cited by 70 publications

(52 citation statements)

References 22 publications

(31 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Training Data for SE: We utilized a large-scale and highquality simulated dataset described in [24], which includes around 1,000 hours of paired speech samples 1 . As a clean speech corpus, the dataset collects 544 hours of speech recordings with high mean opinion score (MOS) values from the LibriVox corpus [25].…”

Section: Datasetsmentioning

confidence: 99%

See 1 more Smart Citation

Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement

Eskimez

Wang

Tang³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

With the surge of online meetings, it has become more critical than ever to provide high-quality speech audio and live captioning under various noise conditions. However, most monaural speech enhancement (SE) models introduce processing artifacts and thus degrade the performance of downstream tasks, including automatic speech recognition (ASR). This paper proposes a multi-task training framework to make the SE models unharmful to ASR. Because most ASR training samples do not have corresponding clean signal references, we alternately perform two model update steps called SE-step and ASR-step. The SEstep uses clean and noisy signal pairs and a signal-based loss function. The ASR-step applies a pre-trained ASR model to training signals enhanced with the SE model. A cross-entropy loss between the ASR output and reference transcriptions is calculated to update the SE model parameters. Experimental results with realistic large-scale settings using ASR models trained on 75,000-hour data show that the proposed framework improves the word error rate for the SE output by 11.82% with little compromise in the SE quality. Performance analysis is also carried out by changing the ASR model, the data used for the ASR-step, and the schedule of the two update steps.

show abstract

Section: Datasetsmentioning

confidence: 99%

“…In addition, the clean speech in each mixture is convolved with an acoustic room impulse response (RIR) sampled from 7,000 measured and simulated responses. See [24] for details of this dataset. The data are available publicly, except for the 65 hours of the internal noise recordings 2 .…”

Section: Datasetsmentioning

confidence: 99%

Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement

Eskimez

Wang

Tang³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

show abstract

“…1. We use the convolutional recurrent network for speech enhancement (CRUSE) proposed in [19], which…”

Section: Speech Psd Estimationmentioning

confidence: 99%

“…4. As baselines we have the unprocessed reference microphone, delay&sum and superdirective MVDR beamformers, the single-channel DNN (CRUSE) [19] applied on the reference mic, mask-based MVDR beamformer using the DNN-mask to adaptively update the noise covariance [21], and the competitive RLS-WPD [15] as the state-of-the-art online convolutional beamformer.…”

Section: Evaluation Setupmentioning

confidence: 99%

“…The steering vector a(k, n) is estimated using a spatial probability-based far-field localization method [17] based on the simple plane wave sound propagation model. CRUSE is trained on the data from [24] as described in [19], only with adjusted STFT parameters. As the test signals are very short, mostly below 10 s, all adaptive methods are initialized with a prior pass to give the adaptive algorithms a chance to converge.…”

Section: Evaluation Setupmentioning

confidence: 99%

See 1 more Smart Citation

Low Complexity Online Convolutional Beamforming

Braun

Tashev

2021

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

Self Cite

View full text Add to dashboard Cite

Convolutional beamformers integrate the multichannel linear prediction model into beamformers, which provide good performance and optimality for joint dereverberation and noise reduction tasks. While longer filters are required to model long reverberation times, the computational burden of current online solutions grows fast with the filter length and number of microphones. In this work, we propose a low complexity convolutional beamformer using a Kalman filter derived affine projection algorithm to solve the adaptive filtering problem. The proposed solution is several orders of magnitude less complex than comparable existing solutions while slightly outperforming them on the REVERB challenge dataset.

show abstract

Audio-Restauration

Kaminski,

Seipel

2024

Handbuch Der Audiotechnik

View full text Add to dashboard Cite

Towards Efficient Models for Real-Time Deep Noise Suppression

Cited by 70 publications

References 22 publications

Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement

Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement

Low Complexity Online Convolutional Beamforming

Audio-Restauration

Contact Info

Product

Resources

About