DNN-based Distributed Multichannel Mask Estimation for Speech Enhancement in Microphone Arrays

Furnon, Nicolas; Serizel, Romain; Illina, Irina; Essid, Slim

doi:10.1109/icassp40776.2020.9054643

Cited by 8 publications

(12 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly to DNN-BF, several existing works integrate the time-frequency masks and the multichannel beamformer. These works either employ a single-channel DNN model [40]- [42], which exploit the spectral information only, or employ a multichannel DNN model [43]- [45], which exploit both the spectral and spatial information of the microphone signals. These various types of DNN models can also be used in the proposed method.…”

Section: Discussionmentioning

confidence: 99%

“…Multi-channel approaches typically use time-frequency masks estimated by the DNN model to construct a spatial filter to enhance the target sound [40]- [45]. Extensions of this idea [46], [47] estimate the coefficients of the filter directly from the multi-channel data, which however require a large amount of training data simulated in a variety of scenarios.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Deep Learning Assisted Time-Frequency Processing for Speech Enhancement on Drones

Wang

Cavallaro

2021

IEEE Trans. Emerg. Top. Comput. Intell.

View full text Add to dashboard Cite

This article fills the gap between the growing interest in signal processing based on Deep Neural Networks (DNN) and the new application of enhancing speech captured by microphones on a drone. In this context, the quality of the target sound is degraded significantly by the strong ego-noise from the rotating motors and propellers. We present the first work that integrates single-channel and multi-channel DNN-based approaches for speech enhancement on drones. We employ a DNN to estimate the ideal ratio masks at individual time-frequency bins, which are subsequently used to design three potential speech enhancement systems, namely singlechannel ego-noise reduction (DNN-S), multi-channel beamforming (DNN-BF), and multi-channel time-frequency spatial filtering (DNN-TF). The main novelty lies in the proposed DNN-TF algorithm, which infers the noise-dominance probabilities at individual time-frequency bins from the DNN-estimated soft masks, and then incorporates them into a time-frequency spatial filtering framework for ego-noise reduction. By jointly exploiting the direction of arrival of the target sound, the time-frequency sparsity of the acoustic signals (speech and ego-noise) and the time-frequency noise-dominance probability, DNN-TF can suppress the ego-noise effectively in scenarios with very low signal-to-noise ratios (e.g. SNR lower than −15 dB), especially when the direction of the target sound is close to that of a source of the ego-noise. Experiments with real and simulated data show the advantage of DNN-TF over competing methods, including DNN-S, DNN-BF and the state-of-the-art time-frequency spatial filtering.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Deep Learning Assisted Time-Frequency Processing for Speech Enhancement on Drones

Wang

Cavallaro

2021

IEEE Trans. Emerg. Top. Comput. Intell.

View full text Add to dashboard Cite

show abstract

“…In previous work [18], we replaced the oracle VAD used in DANSE by a TF mask predicted by a convolutional recurrent neural network (CRNN), in a similar manner as [7], [8]. We showed that the compressed signals sent to compute the filter of Equation ( 2) could also help to improve the mask prediction at the second step by a multi-node DNN.…”

Section: Dnn-based Distributed Multichannel Wiener Filtermentioning

confidence: 99%

Attention-based distributed speech enhancement for unconstrained microphone arrays with varying number of nodes

Furnon¹,

Serizel²,

Essid³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Speech enhancement promises higher efficiency in ad-hoc microphone arrays than in constrained microphone arrays thanks to the wide spatial coverage of the devices in the acoustic scene. However, speech enhancement in ad-hoc microphone arrays still raises many challenges. In particular, the algorithms should be able to handle a variable number of microphones, as some devices in the array might appear or disappear. In this paper, we propose a solution that can efficiently process the spatial information captured by the different devices of the microphone array, while being robust to a link failure. To do this, we use an attention mechanism in order to put more weight on the relevant signals sent throughout the array and to neglect the redundant or empty channels.

show abstract

“…One way to reduce the computational cost of the DNN-based methodologies while exploiting spatial information is to use ad-hoc microphone arrays and to distribute the processing over all the devices of the array. In a previous article, we introduced a solution that proved to efficiently process multichannel data in a distributed microphone array in the context of speech enhancement [14]. This approach was based on a two-step version of the distributed adaptive node-specific signal estimation (DANSE) algorithm by Bertrand and Moonen, where so-called compressed signals are sent among the devices [15].…”

Section: Introductionmentioning

confidence: 99%

Distributed Speech Separation in Spatially Unconstrained Microphone Arrays

Furnon

Serizel

Illina

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Speech separation with several speakers is a challenging task because of the non-stationarity of the speech and the strong signal similarity between interferent sources. Current state-of-the-art solutions can separate well the different sources using sophisticated deep neural networks which are very tedious to train. When several microphones are available, spatial information can be exploited to design much simpler algorithms to discriminate speakers. We propose a distributed algorithm that can process spatial information in a spatially unconstrained microphone array. The algorithm relies on a convolutional recurrent neural network that can exploit the signal diversity from the distributed nodes. In a typical case of a meeting room, this algorithm can capture an estimate of each source in a first step and propagate it over the microphone array in order to increase the separation performance in a second step. We show that this approach performs even better when the number of sources and nodes increases. We also study the influence of a mismatch in the number of sources between the training and testing conditions.

show abstract

DNN-based Distributed Multichannel Mask Estimation for Speech Enhancement in Microphone Arrays

Cited by 8 publications

References 24 publications

Deep Learning Assisted Time-Frequency Processing for Speech Enhancement on Drones

Deep Learning Assisted Time-Frequency Processing for Speech Enhancement on Drones

Attention-based distributed speech enhancement for unconstrained microphone arrays with varying number of nodes

Distributed Speech Separation in Spatially Unconstrained Microphone Arrays

Contact Info

Product

Resources

About