Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification

Zhu, Yingke; Ko, Tom; Snyder, David; Mak, Brian; Povey, Daniel

doi:10.21437/interspeech.2018-1158

Cited by 228 publications

(185 citation statements)

References 14 publications

(29 reference statements)

Supporting

Mentioning

170

Contrasting

Order By: Relevance

“…Attention mechanisms have led to significant advances across computer vision, spoken language understanding and natural language processing, increasing the modelling capacity of deep neural networks by concentrating on crucial features and suppressing unimportant ones. For speaker recognition, [9,10] utilize self-attention for aggregating frame-level features. [11] combined attention mechanism with statistics pooling [5] to propose attentive-statistics pooling.…”

Section: Related Workmentioning

confidence: 99%

“…[7] proposed the usage of dictionarybased NetVLAD or GhostVLAD [8] for aggregating temporal features, using a 34-layer ResNet based front-end for feature extraction. Numerous recent works [9,10,11,12] have proposed attention based techniques for aggregation of framelevel feature descriptors, to assign greater importance to the more discriminative frames.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition

Yadav¹,

Rai²

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Majority of the recent approaches for text-independent speaker recognition apply attention or similar techniques for aggregation of frame-level feature descriptors generated by a deep neural network (DNN) front-end. In this paper, we propose methods of convolutional attention for independently modelling temporal and frequency information in a convolutional neural network (CNN) based front-end. Our system utilizes convolutional block attention modules (CBAMs) [1] appropriately modified to accommodate spectrogram inputs. The proposed CNN front-end fitted with the proposed convolutional attention modules outperform the no-attention and spatial-CBAM baselines by a significant margin on the Vox-Celeb [2, 3] speaker verification benchmark. Our best model achieves an equal error rate of 2.031% on the VoxCeleb1 test set, which is a considerable improvement over comparable state of the art results. For a more thorough assessment of the effects of frequency and temporal attention in real-world conditions, we conduct ablation experiments by randomly dropping frequency bins and temporal frames from the input spectrograms, concluding that instead of modelling either of the entities, simultaneously modelling temporal and frequency attention translates to better real-world performance.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition

Yadav¹,

Rai²

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…It is now widely used for speaker recognition and is effective in speaker embedding extraction. The second baseline ("X-Vectors+Attention") is made by combining a global attention mechanism with a TDNN [13,14]. For evaluation, in our speaker identification task, correct prediction rate (prediction accuracy) is reported in this work.…”

Section: Experiments Setupmentioning

confidence: 99%

“…David et al [12] used a five-layer DNN with taking into account a small temporal context and statistics pooling. To further improve the performance for embedding generation, attention mechanisms have been also used in some recent studies [13,14]. Wang, et al [13] used attentive X-vector where a self-attention layer was added before a statistic pooling layer to weight each frame.…”

Section: Introductionmentioning

confidence: 99%

H-Vectors: Utterance-Level Speaker Embedding Using a Hierarchical Attention Model

Shi

Huang

Hain

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, a hierarchical attention network to generate utterance-level embeddings (H-vectors) for speaker identification is proposed. Since different parts of an utterance may have different contributions to speaker identities, the use of hierarchical structure aims to learn speaker related information locally and globally. In the proposed approach, frame-level encoder and attention are applied on segments of an input utterance and generate individual segment vectors. Then, segment level attention is applied on the segment vectors to construct an utterance representation. To evaluate the effectiveness of the proposed approach, NIST SRE 2008 Part1 dataset is used for training, and two datasets, Switchboard Cellular part1 and CallHome American English Speech, are used to evaluate the quality of extracted utterance embeddings on speaker identification and verification tasks. In comparison with two baselines, X-vector, X-vector+Attention, the obtained results show that H-vectors can achieve a significantly better performance. Furthermore, the extracted utterance-level embeddings are more discriminative than the two baselines when mapped into a 2D space using t-SNE.

show abstract

“…The self-attention and i-vector based attention stand for two kinds of algorithm in SV field [9,10].  In single-head self-attention [9], Eq. (1) can be written as:…”

Section: Attentive Statistics Poolingmentioning

confidence: 99%

An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales

Guo

Dai

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper presents an improved deep embedding learning method based on convolutional neural network (CNN) for text-independent speaker verification. Two improvements are proposed for x-vector embedding learning: (1) Multiscale convolution (MSCNN) is adopted in frame-level layers to capture complementary speaker information in different receptive fields.(2) A Baum-Welch statistics attention (BWSA) mechanism is applied in pooling-layer, which can integrate more useful long-term speaker characteristics in the temporal pooling layer. Experiments are carried out on the NIST SRE16 evaluation set. The results demonstrate the effectiveness of MSCNN and show the proposed BWSA can further improve the performance of the DNN embedding system.

show abstract

Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification

Cited by 228 publications

References 14 publications

Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition

Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition

H-Vectors: Utterance-Level Speaker Embedding Using a Hierarchical Attention Model

An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales

Contact Info

Product

Resources

About