Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1515
|View full text |Cite
|
Sign up to set email alerts
|

An Improved Deep Embedding Learning Method for Short Duration Speaker Verification

Abstract: The version in the Kent Academic Repository may differ from the final published version. Users are advised to check http://kar.kent.ac.uk for the status of the paper. Users should always cite the published version of record.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
30
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
3
1

Relationship

3
5

Authors

Journals

citations
Cited by 30 publications
(30 citation statements)
references
References 13 publications
0
30
0
Order By: Relevance
“…Being able to benefit from a discriminative training process, deep embedding methods such as d-vector or x-vector have been shown to outperform traditional i-vectors [1,2], especially for short duration utterances. Existing deep embedding learning architectures include time-delay DNN (TDNN) [2], convolutional neural network (CNN) [3,4], and Long Short-Term Memory Network (LSTM) [5]. They generally consist of three main components [6,7]: (1) Frame-level feature processing to model local short spans of acoustic features via TDNN or convolutional layers.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Being able to benefit from a discriminative training process, deep embedding methods such as d-vector or x-vector have been shown to outperform traditional i-vectors [1,2], especially for short duration utterances. Existing deep embedding learning architectures include time-delay DNN (TDNN) [2], convolutional neural network (CNN) [3,4], and Long Short-Term Memory Network (LSTM) [5]. They generally consist of three main components [6,7]: (1) Frame-level feature processing to model local short spans of acoustic features via TDNN or convolutional layers.…”
Section: Introductionmentioning
confidence: 99%
“…Many recent works have focused on utterance-level embedding learning, e.g., average pooling [1], statistical pooling [2], attentive pooling [13,14], cross-convolutional-layer pooling [3], learnable dictionary encoding (LDE) [12]. Besides cross entropy loss (CE), different loss functions have been recently proposed, including triplet loss [15,16], center loss [12,17], angular softmax (A-softmax) [12,18], additive margin softmax (AM-softmax) [19] and logistic margin (LM) [19].…”
Section: Introductionmentioning
confidence: 99%
“…A pooling layer follows to aggregate frame-level outputs, and fullyconnected (FC) layers then map the aggregation to speaker embeddings. Average-pooling, max-pooling [10], statistics pooling [6], attentive pooling [11], and cross-layer bilinear pooling [12] are popular choices.…”
Section: Introductionmentioning
confidence: 99%
“…With the great success of deep neural networks (DNNs) in machine learning fields, more attention has been drawn to the use of DNNs to extract i-vector similar vectors, known as speaker embeddings. Many novel DNN embedding-based systems have been proposed, and they have achieved comparable or even better performance compared with the traditional i-vector paradigm [3,4,5,6,7,8,9,10].…”
Section: Introductionmentioning
confidence: 99%
“…In most DNN embedding systems [5,7,8,9,10], an input utterance with a variable length is first fed into several framelevel layers to obtain high-level feature representations. The frame-level layers are usually modeled by recurrent neural networks (RNNs) [9], convolution neural networks (CNNs) [7,10] or time-delay neural networks (TDNNs) [5,8]. Next, a pooling layer maps all frames of the input utterance into a fixeddimensionality vector, and the speaker embedding is generated from the following stacked fully connected layers.…”
Section: Introductionmentioning
confidence: 99%