Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2459
|View full text |Cite
|
Sign up to set email alerts
|

Influence of Speaker-Specific Parameters on Speech Separation Systems

Abstract: Recent studies have shown that Deep Learning based singlechannel speech separation systems perform worse for samegender mixtures than for different-gender mixtures. In this work, we provide for a more detailed analysis of the respective impact of the fundamental frequency and the vocal tract length on the system performance. While both parameters are correlated with gender, the vocal tract length is a fixed speakerspecific parameter, whereas the fundamental frequency can vary for different speaking styles. We … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
5
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 6 publications
(6 citation statements)
references
References 14 publications
1
5
0
Order By: Relevance
“…Figure 5.12a shows average results on the WSJ-BSS database with a split based on the gender composition of a mixture. In accordance with findings on the WSJ0-2mix database analyzed, for example, in [169] the neural network-based approaches degrade quite a bit when separating speakers of the same gender, in particular two female speakers. However, please note that the WSJ-BSS database consists of considerably fewer female speakers.…”
Section: Analysis Of Splits Of the Wsj-bss Databasesupporting
confidence: 88%
See 1 more Smart Citation
“…Figure 5.12a shows average results on the WSJ-BSS database with a split based on the gender composition of a mixture. In accordance with findings on the WSJ0-2mix database analyzed, for example, in [169] the neural network-based approaches degrade quite a bit when separating speakers of the same gender, in particular two female speakers. However, please note that the WSJ-BSS database consists of considerably fewer female speakers.…”
Section: Analysis Of Splits Of the Wsj-bss Databasesupporting
confidence: 88%
“…When using only spatial TV-cG vMFTV-cGMM [106] GTV-cGMM [166] cACG vMFcACGMM [51] GcACGMM [51] BcACGMM [170] cues, a system is likely to confuse speakers which are very close to each other or even stand behind each other (compare Figure 5.10). When using only spectral cues, it is likely to confuse speakers with similar voices (compare Figure 5.12a or [169] for an analysis of how voice similarity influences DC performance). In comparison to the cascade approach in Section 4.2, the tight integration updates all parameters jointly while the cascade approach can potentially forget the spectral information after sufficiently many EM steps.…”
Section: Tight Integration Of Spatial and Spectral Featuresmentioning
confidence: 99%
“…where the third dimension is given by the vector Z(l, k) defined in (8). The elements of Z(l, k) consists of the real and imaginary part of the normalised amplitudes from the microphones signals.…”
Section: Input Featuresmentioning
confidence: 99%
“…However, when multiple speakers are active simultaneously, they cannot be separated based on generic speech structure alone. Then additional information is needed about the specific speaker characteristics, such as the gender of the target speaker [8] or some latent space embedding of the speaker characteristics [10,24]. With a compact microphone array, however, multiple overlapping speakers can be separated without the the need for prior knowledge on the speaker characteristics -as long as they are not co-located in space.…”
Section: Introductionmentioning
confidence: 99%
“…In addition to the values presented in the tables, we carried out performance analyses based on the difference of the median fundamental frequencies of the speakers within a mixture. As we have shown in [16], the median fundamental frequency difference is an important influencing factor to the performance of monaural speech separation systems and it is of importance to improve the performance especially for mixtures of similar speakers where the fundamental frequency difference is below 50 Hz. For our best model we were able to improve the performance in this important region of fundamental frequency difference by 1.0 dB in contrast to Conv-TasNet with a learned filterbank.…”
Section: Resultsmentioning
confidence: 99%