Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-1575
|View full text |Cite
|
Sign up to set email alerts
|

Deep Speaker Embeddings for Short-Duration Speaker Verification

Abstract: The performance of a state-of-the-art speaker verification system is severely degraded when it is presented with trial recordings of short duration. In this work we propose to use deep neural networks to learn short-duration speaker embeddings. We focus on the 5s-5s condition, wherein both sides of a verification trial are 5 seconds long. In our previous work we established that learning a non-linear mapping from i-vectors to speaker labels is beneficial for speaker verification [1]. In this work we take the i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
129
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 132 publications
(130 citation statements)
references
References 7 publications
1
129
0
Order By: Relevance
“…Self-attentive pooling layer [20], learnable dictionary encoding layer [21], and dictionary-based NetVLAD layer [22,23] are other commonly used encoding layers. Once the utterance-level representation is extracted, a fully connected layer and a speaker classifier are employed to further abstract the speaker representation and classify the training speakers.…”
Section: Revisit: Deep Speaker Embeddingmentioning
confidence: 99%
“…Self-attentive pooling layer [20], learnable dictionary encoding layer [21], and dictionary-based NetVLAD layer [22,23] are other commonly used encoding layers. Once the utterance-level representation is extracted, a fully connected layer and a speaker classifier are employed to further abstract the speaker representation and classify the training speakers.…”
Section: Revisit: Deep Speaker Embeddingmentioning
confidence: 99%
“…h r hr (5) where is the elementwise multiplication, and D  r is the transformed i-vector through a non-linearly affine layer. Since each utterance owns an utterance-dependent i-vector, the attention weights are determined by both t h and the utterance-level information from the i-vector.…”
Section: Attentive Statistics Poolingmentioning
confidence: 99%
“…With the great success of deep learning over a wide range of machine learning tasks, more efforts have been focused on using deep neural network (DNN) to extract more discriminative speaker representations [3,4,5,6]. These deep speaker embedding systems can achieve comparable or even better performance compared with the i-vector based methods, particularly under conditions of short-duration utterances.…”
Section: Introductionmentioning
confidence: 99%
“…The state-of-the-art text-independent speaker verification systems [1][2][3][4] use deep neural networks (DNNs) to project speech recordings with different lengths into a common low dimensional embedding space where the speakers' identities are represented. Such a method is called deep embedding, where the embedding networks have three key components-network structure [1,3,[5][6][7], pooling layer [1,[8][9][10][11][12], and loss function [13][14][15][16][17]. This paper focuses on the last part, i.e., the loss functions.…”
Section: Introductionmentioning
confidence: 99%