2020 28th European Signal Processing Conference (EUSIPCO) 2021
DOI: 10.23919/eusipco47968.2020.9287478
|View full text |Cite
|
Sign up to set email alerts
|

Dilated U-net based approach for multichannel speech enhancement from First-Order Ambisonics recordings

Abstract: We present a CNN architecture for speech enhancement from multichannel first-order Ambisonics mixtures. The data-dependent spatial filters, deduced from a mask-based approach, are used to help an automatic speech recognition engine to face adverse conditions of reverberation and competitive speakers. The mask predictions are provided by a neural network, fed with rough estimations of speech and noise amplitude spectra, under the assumption of known directions of arrival. This study evaluates the replacing of t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 17 publications
(8 citation statements)
references
References 20 publications
(23 reference statements)
0
8
0
Order By: Relevance
“…For Task 1 (SE), we use a Filter and Sum Network architecture (FaSNet) [5], adapted from this public PyTorch implementation 8 . This network is a state-of-the-art neural beamformer that operates in the time domain and, therefore, work on both the magnitude and the phase information of the signal.…”
Section: Baseline Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…For Task 1 (SE), we use a Filter and Sum Network architecture (FaSNet) [5], adapted from this public PyTorch implementation 8 . This network is a state-of-the-art neural beamformer that operates in the time domain and, therefore, work on both the magnitude and the phase information of the signal.…”
Section: Baseline Methodsmentioning
confidence: 99%
“…Neural beamforming techniques as Filter and Sum Networks (FaS-Net) [5] provide state-of-the art results for Ambisonics-based SE and are usually suitable for low-latency scenarios. Also U-Net-based approaches provide competitive results in this context, both for monaural [6,7] and multichannel SE tasks [8], at the expense of higher computational power demand. Other techniques to perform SE include recurrent neural networks (RNNs) [9], graph-based spectral subtraction [10], discriminative learning [11], dilated convolutions [12,13].…”
Section: Introductionmentioning
confidence: 99%
“…The U-Net architecture [16], originally proposed for biomedical signal processing, has recently been used for audio source separation tasks [14,17,18]. More importantly, it has also been adapted for speech enhancement [11] with the addition of dilated convolutions. It has been shown that the generation of filter masks with a dilated U-Net performed better than without dilation, indicating the usefulness of dilation in arXiv:2012.03594v1 [eess.AS] 7 Dec 2020 such a network.…”
Section: U-net Architecturementioning
confidence: 99%
“…Deep networks predominantly use higher-dimension log-power spectra with a comparably long temporal context in an attempt to learn features best representing clean speech [9,10]. Recent works focus on using DNNs to generate ideal filter masks based on these self-learned speech representations [11,12], where each mask corresponds to one target speaker. Psycho-acoustic modeling is combined with a speech quality optimization target in [13].…”
Section: Introductionmentioning
confidence: 99%
“…Neural beamforming techniques as Filter and Sum Networks (FaS-Net) [5] provide state-of-the art results for Ambisonics-based SE and are usually suitable for low-latency scenarios. Also U-Net-based approaches provide competitive results in this context, both for monaural [6,7] and multichannel SE tasks [8], at the expense of higher computational power demand. Other techniques to perform SE include recurrent neural networks (RNNs) [9], graph-based spectral subtraction [10], discriminative learning [11], dilated convolutions [12].…”
Section: Introductionmentioning
confidence: 99%