Attentive Statistics Pooling for Deep Speaker Embedding

Okabe, Koji; Koshinaka, Takafumi; Sairyo, Koichi

doi:10.21437/interspeech.2018-993

Cited by 391 publications

(267 citation statements)

References 21 publications

(49 reference statements)

Supporting

Mentioning

266

Contrasting

Order By: Relevance

“…For speaker recognition, [9,10] utilize self-attention for aggregating frame-level features. [11] combined attention mechanism with statistics pooling [5] to propose attentive-statistics pooling. Most recently, [12] employ the idea of multi-head attention [14] for feature aggregation, outperforming an I-vector+PLDA baseline by 58% (relative).…”

Section: Related Workmentioning

confidence: 99%

“…[7] proposed the usage of dictionarybased NetVLAD or GhostVLAD [8] for aggregating temporal features, using a 34-layer ResNet based front-end for feature extraction. Numerous recent works [9,10,11,12] have proposed attention based techniques for aggregation of framelevel feature descriptors, to assign greater importance to the more discriminative frames.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition

Yadav¹,

Rai²

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Majority of the recent approaches for text-independent speaker recognition apply attention or similar techniques for aggregation of frame-level feature descriptors generated by a deep neural network (DNN) front-end. In this paper, we propose methods of convolutional attention for independently modelling temporal and frequency information in a convolutional neural network (CNN) based front-end. Our system utilizes convolutional block attention modules (CBAMs) [1] appropriately modified to accommodate spectrogram inputs. The proposed CNN front-end fitted with the proposed convolutional attention modules outperform the no-attention and spatial-CBAM baselines by a significant margin on the Vox-Celeb [2, 3] speaker verification benchmark. Our best model achieves an equal error rate of 2.031% on the VoxCeleb1 test set, which is a considerable improvement over comparable state of the art results. For a more thorough assessment of the effects of frequency and temporal attention in real-world conditions, we conduct ablation experiments by randomly dropping frequency bins and temporal frames from the input spectrograms, concluding that instead of modelling either of the entities, simultaneously modelling temporal and frequency attention translates to better real-world performance.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition

Yadav¹,

Rai²

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Instead of using the stats pooling that the original architecture used, attentive stats pooling [17] was used, with 128 hidden units in the single attention head for the VoxCeleb system, and 64 for the CALLHOME system. After pooling, the VoxCeleb system was projected to an embedding of size 512, and CALLHOME to a 128-dimension embedding.…”

Section: Baselinesmentioning

confidence: 99%

“…The improvement of our baseline over the Kaldi baseline for cosine similarity is likely due to the use of attentive statistics pooling and the angular penalty softmax. The most comparable network architecture in the literature is that of Okabe et al [17], which achieves an EER of 3.8% on VoxCeleb. In the recent VoxSRC 4 competition, much lower values for EER on VoxCeleb 1 were achieved (< 2%), generally using much deeper models and also with higher dimension inputs.…”

Section: Der Baseline (Kaldi)mentioning

confidence: 99%

Channel Adversarial Training for Speaker Verification and Diarization

Luu

Bell

Renals

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Previous work has encouraged domain-invariance in deep speaker embedding by adversarially classifying the dataset or labelled environment to which the generated features belong. We propose a training strategy which aims to produce features that are invariant at the granularity of the recording or channel, a finer grained objective than dataset-or environmentinvariance. By training an adversary to predict whether pairs of same-speaker embeddings belong to the same recording in a Siamese fashion, learned features are discouraged from utilizing channel information that may be speaker discriminative during training. Experiments for verification on Vox-Celeb and diarization and verification on CALLHOME show promising improvements over a strong baseline in addition to outperforming a dataset-adversarial model. The VoxCeleb model in particular performs well, achieving a 4% relative improvement in EER over a Kaldi baseline, while using a similar architecture and less training data.

show abstract

“…In order to obtain long-term speaker representation with more discriminative power, attention mechanism [17] is widely used recently. In [27], attentive statistics pooling was proposed to replace the conventional statistics pooling. In [10], multi-head self-attention mechanism was applied.…”

Section: Introductionmentioning

confidence: 99%

An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales

Guo

Dai

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper presents an improved deep embedding learning method based on convolutional neural network (CNN) for text-independent speaker verification. Two improvements are proposed for x-vector embedding learning: (1) Multiscale convolution (MSCNN) is adopted in frame-level layers to capture complementary speaker information in different receptive fields.(2) A Baum-Welch statistics attention (BWSA) mechanism is applied in pooling-layer, which can integrate more useful long-term speaker characteristics in the temporal pooling layer. Experiments are carried out on the NIST SRE16 evaluation set. The results demonstrate the effectiveness of MSCNN and show the proposed BWSA can further improve the performance of the DNN embedding system.

show abstract

Attentive Statistics Pooling for Deep Speaker Embedding

Cited by 391 publications

References 21 publications

Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition

Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition

Channel Adversarial Training for Speaker Verification and Diarization

An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales

Contact Info

Product

Resources

About