Block‐online multi‐channel speech enhancement using deep neural network‐supported relative transfer function estimates

Málek, Jiřı́; Koldovsky, Zbynek; Bohac, Marek

doi:10.1049/iet-spr.2019.0304

Cited by 13 publications

(11 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To avoid learning of RTF components common to all RTFs in the dataset, their mean is subtracted, the VAE is trained on the residual and the mean RTF is added to the output of the VAE again for reconstruction. The network is trained by minimizing (9) with γ = 0.95 by ADAM [31] with an initial learning rate of 10 -3 which is reduced by a factor of five to avoid getting stuck with the training process if the validation loss did not improve by at least 10 -3 within the last five epochs. To avoid overfitting, early stopping is employed and the network parameters of the epoch with lowest validation loss are restored if the validation loss did not improve by at least 10 -3 within the last ten epochs.…”

Section: Methodsmentioning

confidence: 99%

“…To address this problem, RTF estimators that rely on knowledge of the statistical properties of additive noise [6] or employ specialized noise estimators [7] have been proposed. Also deep learning-based methods have been applied for selecting useful frequency bins for RTF estimation [8,9]. When multiple point sources, e.g., speakers, are present, source separation methods like directionally constrained Blind Source Separation (BSS) [10,11] or simplex analysis [12] have been applied for RTF estimation.…”

Section: Introduction and Signal Modelmentioning

confidence: 99%

See 1 more Smart Citation

Manifold learning-supported estimation of relative transfer functions for spatial filtering

Brendel¹,

Zeitler²,

Kellermann³

2021

Preprint

View full text Add to dashboard Cite

Many spatial filtering algorithms used for voice capture in, e.g., teleconferencing applications, can benefit from or even rely on knowledge of Relative Transfer Functions (RTFs). Accordingly, many RTF estimators have been proposed which, however, suffer from performance degradation under acoustically adverse conditions or need prior knowledge on the properties of the interfering sources. While state-of-the-art RTF estimators ignore prior knowledge about the acoustic enclosure, audio signal processing algorithms for teleconferencing equipment are often operating in the same or at least a similar acoustic enclosure, e.g., a car or an office, such that training data can be collected. In this contribution, we use such data to train Variational Autoencoders (VAEs) in an unsupervised manner and apply the trained VAEs to enhance imprecise RTF estimates. Furthermore, a hybrid between classic RTF estimation and the trained VAE is investigated. Comprehensive experiments with real-world data confirm the efficacy for the proposed method.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introduction and Signal Modelmentioning

confidence: 99%

Manifold learning-supported estimation of relative transfer functions for spatial filtering

Brendel¹,

Zeitler²,

Kellermann³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Many related studies have investigated online/low-latency processing for mask-based beamformers, e.g., [7], [11], [14], [15]. Most of these studies focused on the online computation of the beamformer coefficients given the masks.…”

Section: Related Workmentioning

confidence: 99%

“…We can estimate the time-varying SCMs using online or blockwise processing. For example, the online mask-based beamformer [7], [11], [14], [15] sequentially updates the SCMs. These approaches estimate one SCM and the resultant beamforming filters for each frame or block, not for the entire utterance, and thus they could potentially deal with moving sources.…”

Section: Introductionmentioning

confidence: 99%

Mask-based Neural Beamforming for Moving Speakers with Self-Attention-based Tracking

Ochiai¹,

Delcroix²,

Nakatani³

et al. 2022

Preprint

View full text Add to dashboard Cite

Beamforming is a powerful tool designed to enhance speech signals from the direction of a target source. Computing the beamforming filter requires estimating spatial covariance matrices (SCMs) of the source and noise signals. Time-frequency masks are often used to compute these SCMs. Most studies of mask-based beamforming have assumed that the sources do not move. However, sources often move in practice, which causes performance degradation. In this paper, we address the problem of mask-based beamforming for moving sources. We first review classical approaches to tracking a moving source, which perform online or blockwise computation of the SCMs. We show that these approaches can be interpreted as computing a sum of instantaneous SCMs weighted by attention weights. These weights indicate which time frames of the signal to consider in the SCM computation. Online or blockwise computation assumes a heuristic and deterministic way of computing these attention weights that, although simple, may not result in optimal performance. We thus introduce a learning-based framework that computes optimal attention weights for beamforming. We achieve this using a neural network implemented with selfattention layers. We show experimentally that our proposed framework can greatly improve beamforming performance in moving source situations while maintaining high performance in non-moving situations, thus enabling the development of maskbased beamformers robust to source movements.

show abstract

“…The recording device is a tablet with multiple microphones, which is held by a speaker. Since some recordings involve microphone failures, the method from [48] is used to detect these failures. If detected, the malfunctioning channels are excluded from further processing of the given recording.…”

Section: Speech Enhancement/recognition On Chime-4 Datasetsmentioning

confidence: 99%

Auxiliary Function-Based Algorithm for Blind Extraction of a Moving Speaker

Janský¹,

Koldovsky²,

Málek³

et al. 2020

Preprint

View full text Add to dashboard Cite

Independent Vector Extraction (IVE) is a modification of Independent Vector Analysis (IVA) for Blind Source Extraction (BSE) to a setup in which only one source of interest (SOI) should be separated from a mixture of signals observed by microphones. The fundamental assumption is that the SOI is independent of the other signals. IVE shows reasonable results; however, its basic variant is limited to static sources. To extract a moving source, IVE has recently been extended by considering the Constant Separating Vector (CSV) mixing model. It enables us to estimate a separating filter that extracts the SOI from a wider spatial area through which the source has moved. However, only slow gradient-based algorithms were proposed in the pioneering papers on IVE and CSV. In this paper, we experimentally verify the applicability of the CSV mixing model and propose new IVE methods derived by modifying the auxiliary function-based algorithm for IVA. Piloted Variants are proposed as well for the methods with partially controllable global convergence. The methods are verified under reverberant and noisy conditions using model-based as well as real-world acoustic impulse responses. They are also verified within the CHiME-4 speech separation and recognition challenge. The experiments corroborate the applicability of the CSV mixing model for the blind moving source extraction as well as the improved convergence of the proposed algorithms.

show abstract

Block‐online multi‐channel speech enhancement using deep neural network‐supported relative transfer function estimates

Cited by 13 publications

References 50 publications

Manifold learning-supported estimation of relative transfer functions for spatial filtering

Manifold learning-supported estimation of relative transfer functions for spatial filtering

Mask-based Neural Beamforming for Moving Speakers with Self-Attention-based Tracking

Auxiliary Function-Based Algorithm for Blind Extraction of a Moving Speaker

Contact Info

Product

Resources

About