Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1397
|View full text |Cite
|
Sign up to set email alerts
|

SpEx+: A Complete Time Domain Speaker Extraction Network

Abstract: Speaker extraction aims to extract the target speech signal from a multi-talker environment given a target speaker's reference speech. We recently proposed a time-domain solution, SpEx, that avoids the phase estimation in frequency-domain approaches. Unfortunately, SpEx is not fully a time-domain solution since it performs time-domain speech encoding for speaker extraction, while taking frequency-domain speaker embedding as the reference. The size of the analysis window for timedomain and the size for frequenc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
89
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 88 publications
(89 citation statements)
references
References 24 publications
0
89
0
Order By: Relevance
“…As illustrated in Fig. 2(a), speaker encoder consists a stack of three residual blocks followed by an adaptive average pooling layer (Avg-Pool) [10]. The speaker encoder takes a temporal sequence V (t) Ŝ r−1 (t) as input, wherê…”
Section: Speaker Encodermentioning
confidence: 99%
See 1 more Smart Citation
“…As illustrated in Fig. 2(a), speaker encoder consists a stack of three residual blocks followed by an adaptive average pooling layer (Avg-Pool) [10]. The speaker encoder takes a temporal sequence V (t) Ŝ r−1 (t) as input, wherê…”
Section: Speaker Encodermentioning
confidence: 99%
“…speaker extraction [8]; SpEx/SpEx+ is another successful implementation that trains speaker embedding network jointly with speaker extraction network [7,10].…”
Section: Introductionmentioning
confidence: 99%
“…3. Improved separator with U-Conv blocks level features inside the separation model [6,19] or concatenate the speaker features with the mixture speech representations [8]. However, it is not trivial to find a single optimal layer at which to insert the speaker features.…”
Section: Proposed Speech Extraction Structurementioning
confidence: 99%
“…An alternative solution to the label permutation problem is to perform target speaker extraction [6][7][8]. In this case, the separation model is biased with information about the identity of the target speaker to extract from the mixture.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation