An Improved Deep Embedding Learning Method for Short Duration Speaker Verification

Gao, Zhifu; Yan, Shuicheng; McLoughlin, Ian; Guo, Wenbin; Dai, Li-Rong

doi:10.21437/interspeech.2018-1515

Cited by 30 publications

(30 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Being able to benefit from a discriminative training process, deep embedding methods such as d-vector or x-vector have been shown to outperform traditional i-vectors [1,2], especially for short duration utterances. Existing deep embedding learning architectures include time-delay DNN (TDNN) [2], convolutional neural network (CNN) [3,4], and Long Short-Term Memory Network (LSTM) [5]. They generally consist of three main components [6,7]: (1) Frame-level feature processing to model local short spans of acoustic features via TDNN or convolutional layers.…”

Section: Introductionmentioning

confidence: 99%

“…Many recent works have focused on utterance-level embedding learning, e.g., average pooling [1], statistical pooling [2], attentive pooling [13,14], cross-convolutional-layer pooling [3], learnable dictionary encoding (LDE) [12]. Besides cross entropy loss (CE), different loss functions have been recently proposed, including triplet loss [15,16], center loss [12,17], angular softmax (A-softmax) [12,18], additive margin softmax (AM-softmax) [19] and logistic margin (LM) [19].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System

Gao¹,

Yan²,

McLoughlin³

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

Deep embedding learning based speaker verification (SV) methods have recently achieved significant performance improvement over traditional i-vector systems, especially for short duration utterances. Embedding learning commonly consists of three components: frame-level feature processing, utterancelevel embedding learning, and loss function to discriminate between speakers. For the learned embeddings, a back-end model (i.e., Linear Discriminant Analysis followed by Probabilistic Linear Discriminant Analysis (LDA-PLDA)) is generally applied as a similarity measure. In this paper, we propose to further improve the effectiveness of deep embedding learning methods in the following components: (1) A multi-stage aggregation strategy, exploited to hierarchically fuse time-frequency context information for effective frame-level feature processing. (2) A discriminant analysis loss is designed for end-to-end training, which aims to explicitly learn the discriminative embeddings, i.e. with small intra-speaker and large inter-speaker variances. To evaluate the effectiveness of the proposed improvements, we conduct extensive experiments on the VoxCeleb1 dataset. The results outperform state-of-the-art systems by a significant margin. It is also worth noting that the results are obtained using a simple cosine metric instead of the more complex LDA-PLDA backend scoring.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System

Gao¹,

Yan²,

McLoughlin³

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…A pooling layer follows to aggregate frame-level outputs, and fullyconnected (FC) layers then map the aggregation to speaker embeddings. Average-pooling, max-pooling [10], statistics pooling [6], attentive pooling [11], and cross-layer bilinear pooling [12] are popular choices.…”

Section: Introductionmentioning

confidence: 99%

An Effective Deep Embedding Learning Architecture for Speaker Verification

et al. 2019

Self Cite

View full text Add to dashboard Cite

In this paper we present an effective deep embedding learning architecture, which combines a dense connection of dilated convolutional layers with a gating mechanism, for speaker verification (SV) tasks. Compared with the widely used time-delay neural network (TDNN) based architecture, two main improvements are proposed: (1) The dilated filters are designed to effectively capture time-frequency context information, then the convolutional layer outputs are utilized for effective embedding learning. Specifically, we employ the idea of the successful DenseNet to collect the context information by dense connections from each layer to every other layer in a feed-forward fashion. (2) A gating mechanism is further introduced to provide channel-wise attention by exploiting inter-dependencies across channels. Motivated by squeeze-and-excitation networks (SENet), the global time-frequency information is utilized for this feature calibration. To evaluate the proposed network architecture, we conduct extensive experiments on noisy and unconstrained SV tasks, i.e., Speaker in the Wild (SITW) and Voxceleb1. The results demonstrate state-of-the-art SV performance. Specifically, our proposed method reduces equal error rate (EER) from TDNN based method by 25% and 27% for SITW and Voxceleb1, respectively.

show abstract

“…With the great success of deep neural networks (DNNs) in machine learning fields, more attention has been drawn to the use of DNNs to extract i-vector similar vectors, known as speaker embeddings. Many novel DNN embedding-based systems have been proposed, and they have achieved comparable or even better performance compared with the traditional i-vector paradigm [3,4,5,6,7,8,9,10].…”

Section: Introductionmentioning

confidence: 99%

“…In most DNN embedding systems [5,7,8,9,10], an input utterance with a variable length is first fed into several framelevel layers to obtain high-level feature representations. The frame-level layers are usually modeled by recurrent neural networks (RNNs) [9], convolution neural networks (CNNs) [7,10] or time-delay neural networks (TDNNs) [5,8]. Next, a pooling layer maps all frames of the input utterance into a fixeddimensionality vector, and the speaker embedding is generated from the following stacked fully connected layers.…”

Section: Introductionmentioning

confidence: 99%

Deep Neural Network Embeddings with Gating Mechanisms for Text-Independent Speaker Verification

et al. 2019

Self Cite

View full text Add to dashboard Cite

In this paper, gating mechanisms are applied in deep neural network (DNN) training for x-vector-based text-independent speaker verification. First, a gated convolution neural network (GCNN) is employed for modeling the frame-level embedding layers. Compared with the time-delay DNN (TDNN), the GCNN can obtain more expressive frame-level representations through carefully designed memory cell and gating mechanisms. Moreover, we propose a novel gated-attention statistics pooling strategy in which the attention scores are shared with the output gate. The gated-attention statistics pooling combines both gating and attention mechanisms into one framework; therefore, we can capture more useful information in the temporal pooling layer. Experiments are carried out using the NIST SRE16 and SRE18 evaluation datasets. The results demonstrate the effectiveness of the GCNN and show that the proposed gated-attention statistics pooling can further improve the performance.

show abstract

An Improved Deep Embedding Learning Method for Short Duration Speaker Verification

Abstract: The version in the Kent Academic Repository may differ from the final published version. Users are advised to check http://kar.kent.ac.uk for the status of the paper. Users should always cite the published version of record.

Cited by 30 publications

References 13 publications

Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System

Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System

An Effective Deep Embedding Learning Architecture for Speaker Verification

Deep Neural Network Embeddings with Gating Mechanisms for Text-Independent Speaker Verification

Contact Info

Product

Resources

About