SpEx+: A Complete Time Domain Speaker Extraction Network

Ge, Meng; Xu, Chenglin; Wang, Longbiao; Chng, Eng Siong; Dang, Jianwu; Li, Haizhou

doi:10.21437/interspeech.2020-1397

Cited by 88 publications

(89 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As illustrated in Fig. 2(a), speaker encoder consists a stack of three residual blocks followed by an adaptive average pooling layer (Avg-Pool) [10]. The speaker encoder takes a temporal sequence V (t) Ŝ r−1 (t) as input, wherê…”

Section: Speaker Encodermentioning

confidence: 99%

See 1 more Smart Citation

Muse: Multi-Modal Target Speaker Extraction with Visual Cues

Pan

Tao²,

Xu³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Speaker extraction algorithm relies on the speech sample from the target speaker as the reference point to focus its attention. Such a reference speech is typically pre-recorded. On the other hand, the temporal synchronization between speech and lip movement also serves as an informative cue. Motivated by this idea, we study a novel technique to use speech-lip visual cues to extract reference target speech directly from mixture speech during inference time, without the need of pre-recorded reference speech. We propose a multi-modal speaker extraction network, named MuSE, that is conditioned only on a lip image sequence. MuSE not only outperforms other competitive baselines in terms of SI-SDR and PESQ, but also shows consistent improvement in cross-dataset evaluations.

show abstract

Section: Speaker Encodermentioning

confidence: 99%

“…speaker extraction [8]; SpEx/SpEx+ is another successful implementation that trains speaker embedding network jointly with speaker extraction network [7,10].…”

Section: Introductionmentioning

confidence: 99%

Muse: Multi-Modal Target Speaker Extraction with Visual Cues

Pan

Tao²,

Xu³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…3. Improved separator with U-Conv blocks level features inside the separation model [6,19] or concatenate the speaker features with the mixture speech representations [8]. However, it is not trivial to find a single optimal layer at which to insert the speaker features.…”

Section: Proposed Speech Extraction Structurementioning

confidence: 99%

“…An alternative solution to the label permutation problem is to perform target speaker extraction [6][7][8]. In this case, the separation model is biased with information about the identity of the target speaker to extract from the mixture.…”

Section: Introductionmentioning

confidence: 99%

“…The speaker embedding network can be either jointly trained with the speech extraction model to minimise the enhancement loss or trained on a different task, i.e., a speaker recognition task, to access larger speaker variations [9]. The target speaker embedding is usually inserted into the middle-stage features of the extraction network by using multiplication [7] or concatenation operations [8,10], however, the shared middle-features in the extraction model may not be optimal for both tasks of speaker conditioning and speech reconstruction.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Time-Domain Speech Extraction with Spatial Information and Multi Speaker Conditioning Mechanism

Zhang

Zorilă

Doddipatla

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, we present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture in noisy and reverberant environments. The proposed method is built on an improved multi-channel time-domain speech separation network which employs speaker embeddings to identify and extract multiple targets without label permutation ambiguity. To efficiently inform the speaker information to the extraction model, we propose a new speaker conditioning mechanism by designing an additional speaker branch for receiving external speaker embeddings. Experiments on 2-channel WHAMR! data show that the proposed system improves by 9% relative the source separation performance over a strong multi-channel baseline, and it increases the speech recognition accuracy by more than 16% relative over the same baseline.

show abstract