ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8682712
|View full text |Cite
|
Sign up to set email alerts
|

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification

Abstract: This paper aims to improve the widely used deep speaker embedding x-vector model. We propose the following improvements: (1) a hybrid neural network structure using both time delay neural network (TDNN) and long short-term memory neural networks (LSTM) to generate complementary speaker information at different levels; (2) a multi-level pooling strategy to collect speaker information from both TDNN and LSTM layers; (3) a regularization scheme on the speaker embedding extraction layer to make the extracted embed… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
41
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 71 publications
(41 citation statements)
references
References 20 publications
0
41
0
Order By: Relevance
“…Further experiments on WSJ and LibriSpeech show that our attention mechanism could achieve the best performance among all end-to-end methods without data augmentation, and it is only slightly worse than the state-of-the-art performance. In the future, we would study other ways of utilizing multi-level information [43].…”
Section: Resultsmentioning
confidence: 99%
“…Further experiments on WSJ and LibriSpeech show that our attention mechanism could achieve the best performance among all end-to-end methods without data augmentation, and it is only slightly worse than the state-of-the-art performance. In the future, we would study other ways of utilizing multi-level information [43].…”
Section: Resultsmentioning
confidence: 99%
“…Since the time delay neural network (TDNN) based xvector system was proposed [32], a series of similar improved networks have been investigated [18,19,25], and a variety of other neural network architectures, such as DenseNet, ResNet and InceptionNet, have also been used to extract more speaker-discriminative embeddings [21,23,24]. In order to obtain long-term speaker representation with more discriminative power, attention mechanism [17] is widely used recently.…”
Section: Introductionmentioning
confidence: 99%
“…On the other hand, during testing, if not enough speech segments are detected, then the SV algorithms will not be able to detect the speaker. For this reason, the VAD has played a vital role in robust SV systems from traditional Gaussian Mixture Model-Universal Background Model (GMM-UBM) and i-vector systems [1,2] to recent deep speaker embedding systems [3,4,5,6].…”
Section: Introductionmentioning
confidence: 99%