Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1428
|View full text |Cite
|
Sign up to set email alerts
|

Deep Learning Based Multi-Channel Speaker Recognition in Noisy and Reverberant Environments

Abstract: Despite successful applications of multi-channel signal processing in robust automatic speech recognition (ASR), relatively little research has been conducted on the effectiveness of such techniques in the robust speaker recognition domain. This paper introduces time-frequency (T-F) maskingbased beamforming to address text-independent speaker recognition in conditions where strong diffuse noise and reverberation are both present. We examine various masking-based beamformers, such as parameterized multi-channel… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
12
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 26 publications
(13 citation statements)
references
References 28 publications
1
12
0
Order By: Relevance
“…In the line of research on masking-based beamforming, earlier efforts [8], [10]- [14] only use DNN on spectral features to compute a mask for each microphone, and the estimated masks at different microphones are then pooled together to identify T-F units dominated by the same source across all the microphones for covariance matrix computation. Subsequent studies incorporate spatial features such as inter-channel phase differences (IPD) [15], [16], cosine and sine IPD, target direction compensated IPD [17], beamforming results [18], [19], and stacked phases and magnitudes [20], [21] as a way of leveraging spatial information to further improve mask estimation for beamforming.…”
Section: Introductionmentioning
confidence: 99%
“…In the line of research on masking-based beamforming, earlier efforts [8], [10]- [14] only use DNN on spectral features to compute a mask for each microphone, and the estimated masks at different microphones are then pooled together to identify T-F units dominated by the same source across all the microphones for covariance matrix computation. Subsequent studies incorporate spatial features such as inter-channel phase differences (IPD) [15], [16], cosine and sine IPD, target direction compensated IPD [17], beamforming results [18], [19], and stacked phases and magnitudes [20], [21] as a way of leveraging spatial information to further improve mask estimation for beamforming.…”
Section: Introductionmentioning
confidence: 99%
“…Then, the parameters of the ResNet layer and SAP layer were fixed and sent to the multi-channel ASV. Finally, we trained the STB blocks of the proposed STB-ASV with Libri-adhoc-simu and Libri-adhoc40 data respectively, where the number of the spatiotemporal blocks is 2, and the number of the attention heads We used voxceleb trainer 3 to build our models. The preprocessing of the data and training setting of the proposed model is the same as [13].…”
Section: Methodsmentioning
confidence: 99%
“…Due to noise, reverberation and speech signal attenuation, the performance of single-channel ASV has dropped sharply and still faces challenges in the far-field environment. In order to make these smart devices robust against noise and reverberation environment, one approach is to equip them with multiple microphones so that the spectral and spatial diversity of the target and interference signals can be leveraged using beamforming approaches [1][2][3]. It has been demonstrated in [4][5][6][7] that single-channel and multi-channel speech enhancement leads to substantial improvement of ASV.…”
Section: Introductionmentioning
confidence: 99%
“…In the best case a 30% percent improvement of EER has been reported. In [12] several masking-based beamformers used for denoising and dereverberation. The MVDR Rank1 beamformer gave the best results for the real RIRs, and the GEVBAN has given the best results with simulated RIRs.…”
Section: Related Workmentioning
confidence: 99%