Influence of Speaker-Specific Parameters on Speech Separation Systems

Ditter, David; Gerkmann, Timo

doi:10.21437/interspeech.2019-2459

Cited by 6 publications

(6 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Figure 5.12a shows average results on the WSJ-BSS database with a split based on the gender composition of a mixture. In accordance with findings on the WSJ0-2mix database analyzed, for example, in [169] the neural network-based approaches degrade quite a bit when separating speakers of the same gender, in particular two female speakers. However, please note that the WSJ-BSS database consists of considerably fewer female speakers.…”

Section: Analysis Of Splits Of the Wsj-bss Databasesupporting

confidence: 88%

“…When using only spatial TV-cG vMFTV-cGMM [106] GTV-cGMM [166] cACG vMFcACGMM [51] GcACGMM [51] BcACGMM [170] cues, a system is likely to confuse speakers which are very close to each other or even stand behind each other (compare Figure 5.10). When using only spectral cues, it is likely to confuse speakers with similar voices (compare Figure 5.12a or [169] for an analysis of how voice similarity influences DC performance). In comparison to the cascade approach in Section 4.2, the tight integration updates all parameters jointly while the cascade approach can potentially forget the spectral information after sufficiently many EM steps.…”

Section: Tight Integration Of Spatial and Spectral Featuresmentioning

confidence: 99%

See 1 more Smart Citation

Integration of Neural Networks and Probabilistic Spatial Models for Acoustic Blind Source Separation

Drude

Haeb‐Umbach

2019

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

Despite a lot of progress in speech separation, enhancement, and automatic speech recognition realistic meeting recognition is still fairly unsolved. Most research on speech separation either focuses on spectral cues to address single-channel recordings or spatial cues to separate multichannel recordings and exclusively either rely on neural networks or probabilistic graphical models. Integrating a spatial clustering approach and a deep learning approach using spectral cues in a single framework can significantly improve automatic speech recognition performance and improve generalizability given that a neural network profits from a vast amount of training data while the probabilistic counterpart adapts to the current scene. This thesis at hand, therefore, concentrates on the integration of two fairly disjoint research streams, namely single-channel deep learning-based source separation and multi-channel probabilistic modelbased source separation. It provides a general framework to integrate spatial and spectral cues in which neural networks and probabilistic graphical models complement each other in achieving state of the art performance in blind source separation on noisy, reverberant data. The efficacy of the proposed approaches is evaluated on simulated artificial mixtures as well as real recordings of simultaneously active speakers. The key findings are (1) a cascade integration in which a neural network initializes a probabilistic graphical model provides substantial improvement, (2) spatial cues can be used for unsupervised training of neural networks, (3) tight integration, an integration in which a joint agreement between both modalities and models is found, leads to lowest word error rates and best generalizability to unseen real mixtures.

show abstract

Section: Analysis Of Splits Of the Wsj-bss Databasesupporting

confidence: 88%

Section: Tight Integration Of Spatial and Spectral Featuresmentioning

confidence: 99%

Integration of Neural Networks and Probabilistic Spatial Models for Acoustic Blind Source Separation

Drude

Haeb‐Umbach

2019

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

show abstract

“…where the third dimension is given by the vector Z(l, k) defined in (8). The elements of Z(l, k) consists of the real and imaginary part of the normalised amplitudes from the microphones signals.…”

Section: Input Featuresmentioning

confidence: 99%

“…However, when multiple speakers are active simultaneously, they cannot be separated based on generic speech structure alone. Then additional information is needed about the specific speaker characteristics, such as the gender of the target speaker [8] or some latent space embedding of the speaker characteristics [10,24]. With a compact microphone array, however, multiple overlapping speakers can be separated without the the need for prior knowledge on the speaker characteristics -as long as they are not co-located in space.…”

Section: Introductionmentioning

confidence: 99%

Improved Separation of Closely-spaced Speakers by Exploiting Auxiliary Direction of Arrival Information within a U-Net Architecture

Kindt

Bohlender

Madhu

2022

2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)

View full text Add to dashboard Cite

Microphone arrays use spatial diversity for separating concurrent audio sources. Source signals from different directions of arrival (DOAs) are captured with DOAdependent time-delays between the microphones. These can be exploited in the short-time Fourier transform domain to yield time-frequency masks that extract a target signal while suppressing unwanted components. Using deep neural networks (DNNs) for mask estimation has drastically improved separation performance. However, separation of closely spaced sources remains difficult due to their similar inter-microphone time delays. We propose using auxiliary information on source DOAs within the DNN to improve the separation. This can be encoded by the expected phase differences between the microphones. Alternatively, the DNN can learn a suitable input representation on its own when provided with a multi-hot encoding of the DOAs. Experimental results demonstrate the benefit of this information for separating closely spaced sources.

show abstract

“…In addition to the values presented in the tables, we carried out performance analyses based on the difference of the median fundamental frequencies of the speakers within a mixture. As we have shown in [16], the median fundamental frequency difference is an important influencing factor to the performance of monaural speech separation systems and it is of importance to improve the performance especially for mixtures of similar speakers where the fundamental frequency difference is below 50 Hz. For our best model we were able to improve the performance in this important region of fundamental frequency difference by 1.0 dB in contrast to Conv-TasNet with a learned filterbank.…”

Section: Resultsmentioning

confidence: 99%

A Multi-Phase Gammatone Filterbank for Speech Separation Via Tasnet

Ditter

Gerkmann

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

In this work, we investigate if the learned encoder of the end-to-end convolutional time domain audio separation network (Conv-TasNet) is the key to its recent success, or if the encoder can just as well be replaced by a deterministic hand-crafted filterbank. Motivated by the resemblance of the trained encoder of Conv-TasNet to auditory filterbanks, we propose to employ a deterministic gammatone filterbank. In contrast to a common gammatone filterbank, our filters are restricted to 2 ms length to allow for low-latency processing. Inspired by the encoder learned by Conv-TasNet, in addition to the logarithmically spaced filters, the proposed filterbank holds multiple gammatone filters at the same center frequency with varying phase shifts. We show that replacing the learned encoder with our proposed multi-phase gammatone filterbank (MP-GTF) even leads to a scale-invariant source-to-noise ratio (SI-SNR) improvement of 0.8 dB. Furthermore, in contrast to using the learned encoder we show that the number of filters can be reduced from 512 to 128 without loss of performance.

show abstract

Influence of Speaker-Specific Parameters on Speech Separation Systems

Cited by 6 publications

References 14 publications

Integration of Neural Networks and Probabilistic Spatial Models for Acoustic Blind Source Separation

Integration of Neural Networks and Probabilistic Spatial Models for Acoustic Blind Source Separation

Improved Separation of Closely-spaced Speakers by Exploiting Auxiliary Direction of Arrival Information within a U-Net Architecture

A Multi-Phase Gammatone Filterbank for Speech Separation Via Tasnet

Contact Info

Product

Resources

About