2020
DOI: 10.1049/iet-spr.2019.0304
|View full text |Cite
|
Sign up to set email alerts
|

Block‐online multi‐channel speech enhancement using deep neural network‐supported relative transfer function estimates

Abstract: This paper addresses the problem of block-online processing for multi-channel speech enhancement. We consider several variants of a system that performs beamforming supported by DNN-based Voice Activity Detection followed by postfiltering. The speaker is targeted through estimating relative transfer functions between microphones. Each block of the input signals is processed independently in order to make the method applicable in highly dynamic environments. The performance loss caused by the short length of th… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 13 publications
(11 citation statements)
references
References 50 publications
0
9
0
Order By: Relevance
“…To avoid learning of RTF components common to all RTFs in the dataset, their mean is subtracted, the VAE is trained on the residual and the mean RTF is added to the output of the VAE again for reconstruction. The network is trained by minimizing (9) with γ = 0.95 by ADAM [31] with an initial learning rate of 10 -3 which is reduced by a factor of five to avoid getting stuck with the training process if the validation loss did not improve by at least 10 -3 within the last five epochs. To avoid overfitting, early stopping is employed and the network parameters of the epoch with lowest validation loss are restored if the validation loss did not improve by at least 10 -3 within the last ten epochs.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…To avoid learning of RTF components common to all RTFs in the dataset, their mean is subtracted, the VAE is trained on the residual and the mean RTF is added to the output of the VAE again for reconstruction. The network is trained by minimizing (9) with γ = 0.95 by ADAM [31] with an initial learning rate of 10 -3 which is reduced by a factor of five to avoid getting stuck with the training process if the validation loss did not improve by at least 10 -3 within the last five epochs. To avoid overfitting, early stopping is employed and the network parameters of the epoch with lowest validation loss are restored if the validation loss did not improve by at least 10 -3 within the last ten epochs.…”
Section: Methodsmentioning
confidence: 99%
“…To address this problem, RTF estimators that rely on knowledge of the statistical properties of additive noise [6] or employ specialized noise estimators [7] have been proposed. Also deep learning-based methods have been applied for selecting useful frequency bins for RTF estimation [8,9]. When multiple point sources, e.g., speakers, are present, source separation methods like directionally constrained Blind Source Separation (BSS) [10,11] or simplex analysis [12] have been applied for RTF estimation.…”
Section: Introduction and Signal Modelmentioning
confidence: 99%
“…Many related studies have investigated online/low-latency processing for mask-based beamformers, e.g., [7], [11], [14], [15]. Most of these studies focused on the online computation of the beamformer coefficients given the masks.…”
Section: Related Workmentioning
confidence: 99%
“…We can estimate the time-varying SCMs using online or blockwise processing. For example, the online mask-based beamformer [7], [11], [14], [15] sequentially updates the SCMs. These approaches estimate one SCM and the resultant beamforming filters for each frame or block, not for the entire utterance, and thus they could potentially deal with moving sources.…”
Section: Introductionmentioning
confidence: 99%
“…The recording device is a tablet with multiple microphones, which is held by a speaker. Since some recordings involve microphone failures, the method from [48] is used to detect these failures. If detected, the malfunctioning channels are excluded from further processing of the given recording.…”
Section: Speech Enhancement/recognition On Chime-4 Datasetsmentioning
confidence: 99%