Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1158
|View full text |Cite
|
Sign up to set email alerts
|

Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification

Abstract: This paper introduces a new method to extract speaker embeddings from a deep neural network (DNN) for text-independent speaker verification. Usually, speaker embeddings are extracted from a speaker-classification DNN that averages the hidden vectors over the frames of a speaker; the hidden vectors produced from all the frames are assumed to be equally important. We relax this assumption and compute the speaker embedding as a weighted average of a speaker's frame-level hidden vectors, and their weights are auto… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
170
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 228 publications
(185 citation statements)
references
References 14 publications
(29 reference statements)
0
170
1
Order By: Relevance
“…Attention mechanisms have led to significant advances across computer vision, spoken language understanding and natural language processing, increasing the modelling capacity of deep neural networks by concentrating on crucial features and suppressing unimportant ones. For speaker recognition, [9,10] utilize self-attention for aggregating frame-level features. [11] combined attention mechanism with statistics pooling [5] to propose attentive-statistics pooling.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Attention mechanisms have led to significant advances across computer vision, spoken language understanding and natural language processing, increasing the modelling capacity of deep neural networks by concentrating on crucial features and suppressing unimportant ones. For speaker recognition, [9,10] utilize self-attention for aggregating frame-level features. [11] combined attention mechanism with statistics pooling [5] to propose attentive-statistics pooling.…”
Section: Related Workmentioning
confidence: 99%
“…[7] proposed the usage of dictionarybased NetVLAD or GhostVLAD [8] for aggregating temporal features, using a 34-layer ResNet based front-end for feature extraction. Numerous recent works [9,10,11,12] have proposed attention based techniques for aggregation of framelevel feature descriptors, to assign greater importance to the more discriminative frames.…”
Section: Introductionmentioning
confidence: 99%
“…It is now widely used for speaker recognition and is effective in speaker embedding extraction. The second baseline ("X-Vectors+Attention") is made by combining a global attention mechanism with a TDNN [13,14]. For evaluation, in our speaker identification task, correct prediction rate (prediction accuracy) is reported in this work.…”
Section: Experiments Setupmentioning
confidence: 99%
“…David et al [12] used a five-layer DNN with taking into account a small temporal context and statistics pooling. To further improve the performance for embedding generation, attention mechanisms have been also used in some recent studies [13,14]. Wang, et al [13] used attentive X-vector where a self-attention layer was added before a statistic pooling layer to weight each frame.…”
Section: Introductionmentioning
confidence: 99%
“…The self-attention and i-vector based attention stand for two kinds of algorithm in SV field [9,10].  In single-head self-attention [9], Eq. (1) can be written as:…”
Section: Attentive Statistics Poolingmentioning
confidence: 99%