Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-993
|View full text |Cite
|
Sign up to set email alerts
|

Attentive Statistics Pooling for Deep Speaker Embedding

Abstract: This paper proposes attentive statistics pooling for deep speaker embedding in text-independent speaker verification. In conventional speaker embedding, frame-level features are averaged over all the frames of a single utterance to form an utterance-level feature. Our method utilizes an attention mechanism to give different weights to different frames and generates not only weighted means but also weighted standard deviations. In this way, it can capture long-term variations in speaker characteristics more eff… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
266
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 391 publications
(267 citation statements)
references
References 21 publications
(49 reference statements)
1
266
0
Order By: Relevance
“…For speaker recognition, [9,10] utilize self-attention for aggregating frame-level features. [11] combined attention mechanism with statistics pooling [5] to propose attentive-statistics pooling. Most recently, [12] employ the idea of multi-head attention [14] for feature aggregation, outperforming an I-vector+PLDA baseline by 58% (relative).…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…For speaker recognition, [9,10] utilize self-attention for aggregating frame-level features. [11] combined attention mechanism with statistics pooling [5] to propose attentive-statistics pooling. Most recently, [12] employ the idea of multi-head attention [14] for feature aggregation, outperforming an I-vector+PLDA baseline by 58% (relative).…”
Section: Related Workmentioning
confidence: 99%
“…[7] proposed the usage of dictionarybased NetVLAD or GhostVLAD [8] for aggregating temporal features, using a 34-layer ResNet based front-end for feature extraction. Numerous recent works [9,10,11,12] have proposed attention based techniques for aggregation of framelevel feature descriptors, to assign greater importance to the more discriminative frames.…”
Section: Introductionmentioning
confidence: 99%
“…Instead of using the stats pooling that the original architecture used, attentive stats pooling [17] was used, with 128 hidden units in the single attention head for the VoxCeleb system, and 64 for the CALLHOME system. After pooling, the VoxCeleb system was projected to an embedding of size 512, and CALLHOME to a 128-dimension embedding.…”
Section: Baselinesmentioning
confidence: 99%
“…The improvement of our baseline over the Kaldi baseline for cosine similarity is likely due to the use of attentive statistics pooling and the angular penalty softmax. The most comparable network architecture in the literature is that of Okabe et al [17], which achieves an EER of 3.8% on VoxCeleb. In the recent VoxSRC 4 competition, much lower values for EER on VoxCeleb 1 were achieved (< 2%), generally using much deeper models and also with higher dimension inputs.…”
Section: Der Baseline (Kaldi)mentioning
confidence: 99%
“…In order to obtain long-term speaker representation with more discriminative power, attention mechanism [17] is widely used recently. In [27], attentive statistics pooling was proposed to replace the conventional statistics pooling. In [10], multi-head self-attention mechanism was applied.…”
Section: Introductionmentioning
confidence: 99%