Double Multi-Head Attention for Speaker Verification

India, Miquel; Safari, Pooyan; Hernando, Javier

doi:10.1109/icassp39728.2021.9414877

Cited by 13 publications

(10 citation statements)

References 23 publications

(35 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The results of [17] experiments are shown in Table 1. Performance was evaluated using Equal Error Rate (EER).…”

Section: Resultsmentioning

confidence: 99%

“…The DMHSA system has been assessed in [17] by VoxCeleb dataset [20,21]. VoxCeleb is a large multimedia database that contains more than one million 16kHz audio utterances for more than 6K celebrities and has two different versions with several evaluation protocols.…”

Section: Methodsmentioning

confidence: 99%

“…VoxCeleb is a large multimedia database that contains more than one million 16kHz audio utterances for more than 6K celebrities and has two different versions with several evaluation protocols. For [17] experiments, Vox-Celeb2 development partition with no augmentation has been used to train all models. The performance of these systems has been evaluated with Vox1-Test, Vox1-E, and Vox1-H conditions.…”

Section: Methodsmentioning

confidence: 99%

“…In [17], DMHSA Pooling was proposed. It is a DL system with an attention-based pooling layer that was developed for SV.…”

Section: Double Multi-head Self-attention Poolingmentioning

confidence: 99%

“…The authors have recently proposed a DL system based on a Double Multi-Head Self-Attention (DMHSA) pooling [17]. Its architecture consists of a Convolutional Neural Network (CNN)-based front-end, followed by an attention-based pooling layer and a set of fully connected layers.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Speaker Characterization by means of Attention Pooling

Costa¹,

India²,

Hernando³

2022

IberSPEECH 2022

View full text Add to dashboard Cite

State-of-the-art Deep Learning systems for speaker verification are commonly based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode variablelength utterances into fixed-length speaker vectors. The authors have recently proposed the use of a Double Multi-Head Self-Attention pooling for speaker recognition, placed between a CNN-based front-end and a set of fully connected layers. This has shown to be an excellent approach to efficiently select the most relevant features captured by the front-end from the speech signal. In this paper we show excellent experimental results by adapting this architecture to other different speaker characterization tasks, such as emotion recognition, sex classification and COVID-19 detection.

show abstract

“…The results of [17] experiments are shown in Table 1. Performance was evaluated using Equal Error Rate (EER).…”

Section: Resultsmentioning

confidence: 99%