Self Multi-Head Attention for Speaker Recognition

India, Miquel; Safari, Pooyan; Hernando, Javier

doi:10.21437/interspeech.2019-2616

Cited by 72 publications

(50 citation statements)

References 28 publications

(43 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[11] combined attention mechanism with statistics pooling [5] to propose attentive-statistics pooling. Most recently, [12] employ the idea of multi-head attention [14] for feature aggregation, outperforming an I-vector+PLDA baseline by 58% (relative). However, by applying attention or similar techniques only on the feature descriptors generated by the DNN front-end and not throughout the front-end model, majority of the recent works are (i) not fully utilising the representation power of DNN front-end models; and (ii) implicitly modelling temporal attention alone in the process.…”

Section: Related Workmentioning

confidence: 99%

“…[7] proposed the usage of dictionarybased NetVLAD or GhostVLAD [8] for aggregating temporal features, using a 34-layer ResNet based front-end for feature extraction. Numerous recent works [9,10,11,12] have proposed attention based techniques for aggregation of framelevel feature descriptors, to assign greater importance to the more discriminative frames.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition

Yadav¹,

Rai²

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Majority of the recent approaches for text-independent speaker recognition apply attention or similar techniques for aggregation of frame-level feature descriptors generated by a deep neural network (DNN) front-end. In this paper, we propose methods of convolutional attention for independently modelling temporal and frequency information in a convolutional neural network (CNN) based front-end. Our system utilizes convolutional block attention modules (CBAMs) [1] appropriately modified to accommodate spectrogram inputs. The proposed CNN front-end fitted with the proposed convolutional attention modules outperform the no-attention and spatial-CBAM baselines by a significant margin on the Vox-Celeb [2, 3] speaker verification benchmark. Our best model achieves an equal error rate of 2.031% on the VoxCeleb1 test set, which is a considerable improvement over comparable state of the art results. For a more thorough assessment of the effects of frequency and temporal attention in real-world conditions, we conduct ablation experiments by randomly dropping frequency bins and temporal frames from the input spectrograms, concluding that instead of modelling either of the entities, simultaneously modelling temporal and frequency attention translates to better real-world performance.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition

Yadav¹,

Rai²

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…For example, [23] encodes short-term talker characteristics from the spectrogram and a multi-head attention model is adopted to map these representations into a long-term speaker embedding. By employing multi-head attention, [24] models the inner dependencies between units with different positions in the learned feature sequence, which enhances the importing of information. Reference [25] employs the multi-head attention to highlight the speaker related features learned from context information in frequency and time domain.…”

Section: ) Multi-head Self-attentionmentioning

confidence: 99%

Multi-Head Self-Attention-Based Deep Clustering for Single-Channel Speech Separation

et al. 2020

View full text Add to dashboard Cite

Turning attention to a particular speaker when many people talk simultaneously is known as the cocktail party problem. It is still a tough task that remained to be solved especially for single-channel speech separation. Inspired by the physiological phenomenon that humans tend to distinguish some attractive sounds from mixed signals, we propose the multi-head self-attention deep clustering network (ADCNet) for this problem. We creatively combine the widely used deep clustering network with multi-head self-attention mechanism and exploit how the number of heads in multi-head self-attention affects separation performance. We also adopt the density-based canopy K-means algorithm to further improve performance. We trained and evaluated our system using the Wall Street Journal dataset (WSJ0) on two and three talker mixtures. Experimental results show the new approach can achieve a better performance compared with many advanced models. INDEX TERMS Single-channel speech separation, deep clustering, multi-head self-attention, density-based canopy K-means. YAN WANG was born in Anhui, China, in 1996. She is currently pursuing the M.S. degree in communication and information systems with the School of Communication and Information Engineering, Shanghai University. Her current research interests include indoor localization and ultra-wideband location.

show abstract

“…Most encoding layers are based on various pooling methods, for example, temporal average pooling (TAP) [10,14,16], global average pooling (GAP) [13,15], and statistical pooling (SP) [6,14,17,18]. In particular, self-attentive pooling (SAP) has improved performance by focusing on the frames for a more discriminative utterance-level feature [10,19,20], and pooling layers provide compressed speaker information by rescaling the input size. These are mainly used with convolutional neural networks (CNN) [10,[13][14][15][16][17]20].…”

mentioning

confidence: 99%

“…In particular, self-attentive pooling (SAP) has improved performance by focusing on the frames for a more discriminative utterance-level feature [10,19,20], and pooling layers provide compressed speaker information by rescaling the input size. These are mainly used with convolutional neural networks (CNN) [10,[13][14][15][16][17]20]. The speaker embedding is extracted using the output value of the last pooling layer in a CNN-based speaker model.…”

mentioning

confidence: 99%

Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System

Seo

Kim

2020

Electronics

View full text Add to dashboard Cite

One of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are relatively large in number, and unspecified variations increase in the multi-layer aggregation. Therefore, in this study, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. To reduce the number of model parameters, we set the ResNet with the scaled channel width and layer depth as a baseline. To control the variability in the training, we apply a self-attention mechanism to perform multi-layer aggregation with dropout regularizations and batch normalizations. Subsequently, we apply a feature recalibration layer to the aggregated feature using fully-connected layers and nonlinear activation functions. Further, deep length normalization is used on a recalibrated feature in the training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively).

show abstract

Self Multi-Head Attention for Speaker Recognition

Cited by 72 publications

References 28 publications

Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition

Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition

Multi-Head Self-Attention-Based Deep Clustering for Single-Channel Speech Separation

Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System

Contact Info

Product

Resources

About