Learning Speaker Aware Offsets for Speaker Adaptation of Neural Networks

Sarı, Leda; Thomas, Samuel; Hasegawa‐Johnson, Mark

doi:10.21437/interspeech.2019-1788

Cited by 8 publications

(11 citation statements)

References 17 publications

(27 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our earlier work on adaptation by speaker aware offsets [12], [34] can also be grouped under the affine transformation category where speaker embeddings generated through an auxiliary network are used as bias vectors and subtracted from main network activations. As compared to [12], [34], in the current work, we investigate more general affine transformations. We also experiment with adding a nonlinearity to the transformation.…”

Section: A Speaker Adaptationmentioning

confidence: 99%

“…The proposed system for joint speaker adaptation and change detection combines the speaker adaptation scheme of [34] with Siamese change detection [30], and introduces an attention block to the auxiliary network. The attention block allows us to integrate ideas from the Siamese network into the new joint model.…”

Section: Joint Speaker Adaptation and Change Detectionmentioning

confidence: 99%

“…In this work, the problem we address is whether we can construct a single acoustic model that can detect the speaker change and automatically adapt to different speakers simultaneously. In order to develop such a system, we combine ideas from neural network based speaker change detection [30] and unsupervised speaker adaptation by an auxiliary network [12], [34]. The key novel contribution of this work is a method that moves the speaker-change decision inside the speech recognition system, in the form of a soft-decision speaker-attention layer.…”

Section: Introductionmentioning

confidence: 99%

“…The method then trains the speaker-attention layer explicitly in order to minimize the ASR error rate. The speaker-attention layer is used to accumulate a soft-decision speaker embedding, and from that point onward, the network behaves similarly to [34]. In addition to the mean normalization proposed in [34], we also investigate an affine and a nonlinear transformation of these activations that depend on the speaker embedding generated by the auxiliary network.…”

Section: Introductionmentioning

confidence: 99%

“…The speaker-attention layer is used to accumulate a soft-decision speaker embedding, and from that point onward, the network behaves similarly to [34]. In addition to the mean normalization proposed in [34], we also investigate an affine and a nonlinear transformation of these activations that depend on the speaker embedding generated by the auxiliary network. We also show that the learned speaker embeddings can be used for speaker segmentation although we do not explicitly train the network with this objective.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection

Sarı

Hasegawa‐Johnson

Thomas

2021

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Speaker adaptation and speaker change detection have both been studied extensively to improve automatic speech recognition (ASR). In many cases, these two problems are investigated separately: speaker change detection is implemented first to obtain single-speaker regions, and speaker adaptation is then performed using the derived speaker segments for improved ASR. However, in an online setting, we want to achieve both goals in a single pass. In this study, we propose a neural network architecture that learns a speaker embedding from which it can perform both speaker adaptation for ASR and speaker change detection. The proposed speaker embedding is computed using self-attention based on an auxiliary network attached to a main ASR network. ASR adaptation is then performed by subtracting, from the main network activations, a segment dependent affine transformation of the learned speaker embedding. In experiments on a broadcast news dataset and the Switchboard conversational dataset, we test our system on utterances with a change point in them and show that the proposed method achieves significantly better performance as compared to the unadapted main network (10-14% relative reduction in word error rate (WER)). The proposed architecture also outperforms three different speaker segmentation methods followed by ASR (around 10% relative reduction in WER).

show abstract

Section: A Speaker Adaptationmentioning

confidence: 99%

Section: Joint Speaker Adaptation and Change Detectionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection

Sarı

Hasegawa‐Johnson

Thomas

2021

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

show abstract

Joint Federated Learning and Personalization for on-Device ASR

Jia,

Li,

Malek

et al. 2023

2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR

Sarı

Moritz

Hori

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR). The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism. The resulting memory vector (M-vector) is concatenated to the acoustic features or to the hidden layer activations of an E2E neural network model. The E2E ASR system is based on the joint connectionist temporal classification and attention-based encoderdecoder architecture. M-vector and i-vector results are compared for inserting them at different layers of the encoder neural network using the WSJ and TED-LIUM2 ASR benchmarks. We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes.

show abstract

Learning Speaker Aware Offsets for Speaker Adaptation of Neural Networks

Cited by 8 publications

References 17 publications

Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection

Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection

Joint Federated Learning and Personalization for on-Device ASR

Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR

Contact Info

Product

Resources

About