Self-Attention Encoding and Pooling for Speaker Recognition

Safari, Pooyan; India, Miquel; Hernando, Javier

doi:10.21437/interspeech.2020-1446

Cited by 52 publications

(35 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Safar et al [20] proposed a self-attention pooling layer and showed it is good at extracting the time-invariant features information. We utilize the self-attention pooling layer to extract the representation from the target encoder.…”

Section: Self-attention Poolingmentioning

confidence: 99%

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

Lin¹,

Lin²,

Chien³

et al. 2021

Preprint

View full text Add to dashboard Cite

Any-to-any voice conversion (VC) aims to convert the timbre of utterances from and to any speakers seen or unseen during training. Various any-to-any VC approaches have been proposed like AUTOVC, AdaINVC, and FragmentVC. AUTOVC, and AdaINVC utilize source and target encoders to disentangle the content and speaker information of the features. Frag-mentVC utilizes two encoders to encode source and target information and adopts cross attention to align the source and target features with similar phonetic content. Moreover, pretrained features are adopted. AUTOVC used dvector to extract speaker information, and self-supervised learning (SSL) features like wav2vec 2.0 is used in FragmentVC to extract the phonetic content information. Different from previous works, we proposed S2VC that utilizes Self-Supervised features as both source and target features for VC model. Supervised phoneme posteriorgram (PPG), which is believed to be speaker-independent and widely used in VC to extract content information, is chosen as a strong baseline for SSL features. The objective evaluation and subjective evaluation both show models taking SSL feature CPC as both source and target features outperforms that taking PPG as source feature, suggesting that SSL features have great potential in improving VC.

show abstract

Section: Self-attention Poolingmentioning

confidence: 99%

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

Lin¹,

Lin²,

Chien³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…To design the lightweight models, Nunes et al [41] proposed a portable model called additive margin MobileNet1D (AM-MobileNet1D) for speaker identification on mobile devices, which uses raw waveform of speeches as input. Safari et al [42] presented a deep speaker embedding architecture based on a self-attention encoding and pooling (SAEP) mechanism, which outperforms x-vector [5] with less parameters. In this paper, we construct the SV model via two specific lightweight techniques: depthwise separable convolution for reducing the parameters of convolutional layers and low-rank matrix factorization to decreasing the parameters of fully connected layers.…”

Section: Lightweight Architectures For Ti-svmentioning

confidence: 99%

RSKNet-MTSP: Effective and Portable Deep Architecture for Speaker Verification

Wu¹,

Guo²,

Zhao³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Lately, there have been several architectures proposed to encode audio utterances into speaker embeddings for different choices of network This work was supported by the Spanish project PID2019-107579RB-I00 / AEI / 10.13039/501100011033. inputs, such as [5,6,7,8,9]. Using Mel-Frequency Cepstral Coefficient (MFCC) features, Time Delay Neural Network (TDNN) [5,6] is the most currently used architecture.…”

Section: Introductionmentioning

confidence: 99%

“…2-D CNNs have also shown competitive results for speaker verification. There are Computer Vision architectures such as VGG [10,7,11,9] and ResNet [8,12,13] that have been adapted to capture speaker discriminative information from the Mel-Spectrograms. In fact, Resnet34 has shown a better performance than TDNN in the most recent speaker verification challenges [14,15].…”

Section: Introductionmentioning

confidence: 99%

Double Multi-Head Attention for Speaker Verification

India

Safari

Hernando

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Most state-of-the-art Deep Learning systems for text-independent speaker verification are based on speaker embedding extractors. These architectures are commonly composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. In this paper we present Double Multi-Head Attention (MHA) pooling, which extends our previous approach based on Self MHA. An additional self attention layer is added to the pooling layer that summarizes the context vectors produced by MHA into a unique speaker representation. This method enhances the pooling mechanism by giving weights to the information captured for each head and it results in creating more discriminative speaker embeddings. We have evaluated our approach with the VoxCeleb2 dataset. Our results show 6.09% and 5.23% relative improvement in terms of EER compared to Self Attention pooling and Self MHA, respectively. According to the obtained results, Double MHA has shown to be an excellent approach to efficiently select the most relevant features captured by the CNN-based front-ends from the speech signal.

show abstract

Self-Attention Encoding and Pooling for Speaker Recognition

Cited by 52 publications

References 0 publications

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

RSKNet-MTSP: Effective and Portable Deep Architecture for Speaker Verification

Double Multi-Head Attention for Speaker Verification

Contact Info

Product

Resources

About