Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2616
|View full text |Cite
|
Sign up to set email alerts
|

Self Multi-Head Attention for Speaker Recognition

Abstract: Most state-of-the-art Deep Learning (DL) approaches for speaker recognition work on a short utterance level. Given the speech signal, these algorithms extract a sequence of speaker embeddings from short segments and those are averaged to obtain an utterance level speaker representation. In this work we propose the use of an attention mechanism to obtain a discriminative speaker embedding given non fixed length speech utterances. Our system is based on a Convolutional Neural Network (CNN) that encodes short-ter… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
40
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 72 publications
(50 citation statements)
references
References 28 publications
(43 reference statements)
0
40
0
Order By: Relevance
“…[11] combined attention mechanism with statistics pooling [5] to propose attentive-statistics pooling. Most recently, [12] employ the idea of multi-head attention [14] for feature aggregation, outperforming an I-vector+PLDA baseline by 58% (relative). However, by applying attention or similar techniques only on the feature descriptors generated by the DNN front-end and not throughout the front-end model, majority of the recent works are (i) not fully utilising the representation power of DNN front-end models; and (ii) implicitly modelling temporal attention alone in the process.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…[11] combined attention mechanism with statistics pooling [5] to propose attentive-statistics pooling. Most recently, [12] employ the idea of multi-head attention [14] for feature aggregation, outperforming an I-vector+PLDA baseline by 58% (relative). However, by applying attention or similar techniques only on the feature descriptors generated by the DNN front-end and not throughout the front-end model, majority of the recent works are (i) not fully utilising the representation power of DNN front-end models; and (ii) implicitly modelling temporal attention alone in the process.…”
Section: Related Workmentioning
confidence: 99%
“…[7] proposed the usage of dictionarybased NetVLAD or GhostVLAD [8] for aggregating temporal features, using a 34-layer ResNet based front-end for feature extraction. Numerous recent works [9,10,11,12] have proposed attention based techniques for aggregation of framelevel feature descriptors, to assign greater importance to the more discriminative frames.…”
Section: Introductionmentioning
confidence: 99%
“…For example, [23] encodes short-term talker characteristics from the spectrogram and a multi-head attention model is adopted to map these representations into a long-term speaker embedding. By employing multi-head attention, [24] models the inner dependencies between units with different positions in the learned feature sequence, which enhances the importing of information. Reference [25] employs the multi-head attention to highlight the speaker related features learned from context information in frequency and time domain.…”
Section: ) Multi-head Self-attentionmentioning
confidence: 99%
“…Most encoding layers are based on various pooling methods, for example, temporal average pooling (TAP) [10,14,16], global average pooling (GAP) [13,15], and statistical pooling (SP) [6,14,17,18]. In particular, self-attentive pooling (SAP) has improved performance by focusing on the frames for a more discriminative utterance-level feature [10,19,20], and pooling layers provide compressed speaker information by rescaling the input size. These are mainly used with convolutional neural networks (CNN) [10,[13][14][15][16][17]20].…”
mentioning
confidence: 99%
“…In particular, self-attentive pooling (SAP) has improved performance by focusing on the frames for a more discriminative utterance-level feature [10,19,20], and pooling layers provide compressed speaker information by rescaling the input size. These are mainly used with convolutional neural networks (CNN) [10,[13][14][15][16][17]20]. The speaker embedding is extracted using the output value of the last pooling layer in a CNN-based speaker model.…”
mentioning
confidence: 99%