Single Channel Target Speaker Extraction and Recognition with Speaker Beam

Delcroix, Marc; Žmolíková, Kateřina; Kinoshita, Keisuke; Ogawa, Akihiro; Nakatani, Tomohiro

doi:10.1109/icassp.2018.8462661

Cited by 163 publications

(131 citation statements)

References 10 publications

Supporting

Mentioning

123

Contrasting

Order By: Relevance

“…Zmolikova et al proposed a targetspeaker neural beamformer that extracts a target speaker's utterances given a short sample of the target speaker's speech [21]. This model was recently extended to deal with ASR-based loss to maximize ASR accuracy with promising results [22]. While the target-speaker models require additional input of a target speaker's speech sample, it can naturally solve the speaker permutation problem across utterances without using additional speaker identification after ASR.…”

Section: Introductionmentioning

confidence: 99%

Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition

et al. 2019

View full text Add to dashboard Cite

In this paper, we propose a novel auxiliary loss function for target-speaker automatic speech recognition (ASR). Our method automatically extracts and transcribes target speaker's utterances from a monaural mixture of multiple speakers speech given a short sample of the target speaker. The proposed auxiliary loss function attempts to additionally maximize interference speaker ASR accuracy during training. This will regularize the network to achieve a better representation for speaker separation, thus achieving better accuracy on the target-speaker ASR. We evaluated our proposed method using two-speakermixed speech in various signal-to-interference-ratio conditions. We first built a strong target-speaker ASR baseline based on the state-of-the-art lattice-free maximum mutual information. This baseline achieved a word error rate (WER) of 18.06% on the test set while a normal ASR trained with clean data produced a completely corrupted result (WER of 84.71%). Then, our proposed loss further reduced the WER by 6.6% relative to this strong baseline, achieving a WER of 16.87%. In addition to the accuracy improvement, we also showed that the auxiliary output branch for the proposed loss can even be used for a secondary ASR for interference speakers' speech.

show abstract

Section: Introductionmentioning

confidence: 99%

Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition

et al. 2019

View full text Add to dashboard Cite

show abstract

“…Recently, to avoid such multistage processing, the use of an auxiliary speaker-aware feature has been investigated [20][21][22]. A clean speech spoken by the target speaker is also passed to the DNN.…”

Section: Auxiliary Speaker-aware Feature For Speech Separationmentioning

confidence: 99%

“…Success of model specialization suggests us that speaker information is important to improve the performance of speech applications including speech enhancement. In fact, for speech separation (or multi-talker separation [1]), several works have succeeded to extract the desired speaker's speech utilizing speaker information as an auxiliary input [20][21][22], in contrast to separating arbitrary speakers' mixture such as deep-clustering [23] and permutation invariant training [24]. A limitation of these studies is that they require a guidance signal such as adaptation utterance, because there is no way of knowing which signal in the speech-mixture is the target.…”

Section: Introductionmentioning

confidence: 99%

Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention

Koizumi

Yatabe

Delcroix

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

106

View full text Add to dashboard Cite

This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features; we extract a speaker representation used for adaptation directly from the test utterance. Conventional studies of deep neural network (DNN)-based speech enhancement mainly focus on building a speaker independent model. Meanwhile, in speech applications including speech recognition and synthesis, it is known that model adaptation to the target speaker improves the accuracy. Our research question is whether a DNN for speech enhancement can be adopted to unknown speakers without any auxiliary guidance signal in test-phase. To achieve this, we adopt multi-task learning of speech enhancement and speaker identification, and use the output of the final hidden layer of speaker identification branch as an auxiliary feature. In addition, we use multi-head self-attention for capturing long-term dependencies in the speech and noise. Experimental results on a public dataset show that our strategy achieves the state-of-the-art performance and also outperform conventional methods in terms of subjective quality.

show abstract

“…For such overlapped speech, neither conventional ASR nor speaker diarization provides a result with sufficient accuracy. It is known that mixing two speech significantly degrades ASR accuracy [4][5][6]. In addition, no speaker overlaps are assumed with most conventional speaker diarization techniques, such as clustering of speech partitions (e.g.…”

Section: Introductionmentioning

confidence: 99%

“…In another line of research, target-speaker (TS) ASR, which automatically extracts and transcribes only the target speaker's utterances given a short sample of that speaker's speech, has been proposed [5,18].Žmolíková et al proposed a target-speaker neural beamformer that extracts a target speaker's utterances given a short sample of that speaker's speech [18]. This model was recently extended to handle ASR-based loss to maximize ASR accuracy with promising results [5]. TS-ASR can naturally solve the speaker-permutation problem across utterances.…”

Section: Introductionmentioning

confidence: 99%

Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

Kanda

Horiguchi

Fujita

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

This paper investigates the use of target-speaker automatic speech recognition (TS-ASR) for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings. TS-ASR is a technique to automatically extract and recognize only the speech of a target speaker given a short sample utterance of that speaker. One obvious drawback of TS-ASR is that it cannot be used when the speakers in the recordings are unknown because it requires a sample of the target speakers in advance of decoding. To remove this limitation, we propose an iterative method, in which (i) the estimation of speaker embeddings and (ii) TS-ASR based on the estimated speaker embeddings are alternately executed. We evaluated the proposed method by using very challenging dialogue recordings in which the speaker overlap ratio was over 20%. We confirmed that the proposed method significantly reduced both the word error rate (WER) and diarization error rate (DER). Our proposed method combined with i-vector speaker embeddings ultimately achieved a WER that differed by only 2.1 % from that of TS-ASR given oracle speaker embeddings. Furthermore, our method can solve speaker diarization simultaneously as a by-product and achieved better DER than that of the conventional clustering-based speaker diarization method based on i-vector.

show abstract

Single Channel Target Speaker Extraction and Recognition with Speaker Beam

Cited by 163 publications

References 10 publications

Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition

Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition

Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention

Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

Contact Info

Product

Resources

About