ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054249
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR

Abstract: We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR). The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism. The resulting memory vector (M-vector) is concatenated to the acoustic features or to the hidden layer activations of an E2E neural network model. The E2E ASR system is based on the joi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
23
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 21 publications
(23 citation statements)
references
References 26 publications
0
23
0
Order By: Relevance
“…On the other hand, owing to the small amounts of data available for adaptation the gains are usually lower that one could obtain with speaker-level clusters. While many approaches use utterances to directly extract corresponding embeddings to use as an auxiliary input for the acoustic model [56]- [59], one can also build a fixed inventory of speaker, domains, or topic codes [60] or embeddings [61], [62] when learning the acoustic model or acoustic encoder, and then use the test utterance to select a combination of these at test stage. The latter approach alleviates the necessity of estimating an accurate representation from small amounts of data.…”
Section: Identifying Adaptation Targetsmentioning
confidence: 99%
See 4 more Smart Citations
“…On the other hand, owing to the small amounts of data available for adaptation the gains are usually lower that one could obtain with speaker-level clusters. While many approaches use utterances to directly extract corresponding embeddings to use as an auxiliary input for the acoustic model [56]- [59], one can also build a fixed inventory of speaker, domains, or topic codes [60] or embeddings [61], [62] when learning the acoustic model or acoustic encoder, and then use the test utterance to select a combination of these at test stage. The latter approach alleviates the necessity of estimating an accurate representation from small amounts of data.…”
Section: Identifying Adaptation Targetsmentioning
confidence: 99%
“…It may be possible to relax the utterance-level constraint by iteratively re-estimating adaptation statistics using a number of preceding segment(s) [57]. Extra care usually needs to be taken to handle silence and speech uttered by different speakers, as failing to do so may deteriorate the overall ASR performance [62]- [64].…”
Section: Identifying Adaptation Targetsmentioning
confidence: 99%
See 3 more Smart Citations