Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2195
|View full text |Cite
|
Sign up to set email alerts
|

Shortcut Connections Based Deep Speaker Embeddings for End-to-End Speaker Verification System

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
31
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
7

Relationship

1
6

Authors

Journals

citations
Cited by 22 publications
(31 citation statements)
references
References 0 publications
0
31
0
Order By: Relevance
“…For all the pooling methods mentioned above, we use only single-scale feature map from the last layer of the feature extractor. Recently, multi-scale aggregation (MSA) methods have been proposed to exploit speaker information at multiple time scales [22], [23], [48], [49], showing the effectiveness in dealing with variable-duration test utterances.…”
Section: Deep Speaker Embedding Learningmentioning
confidence: 99%
See 2 more Smart Citations
“…For all the pooling methods mentioned above, we use only single-scale feature map from the last layer of the feature extractor. Recently, multi-scale aggregation (MSA) methods have been proposed to exploit speaker information at multiple time scales [22], [23], [48], [49], showing the effectiveness in dealing with variable-duration test utterances.…”
Section: Deep Speaker Embedding Learningmentioning
confidence: 99%
“…Even with this robustness, using multi-scale features from multiple layers (Fig. 2(b)), called multi-scale aggregation (MSA), has shown better performance than using singlescale feature maps [22], [23], [48], [49]. Note that, between the frame-and segment-level operations, we should choose the segment-level operation for the MSA because all the feature maps from different layers have the same time scale in the frame-level operation.…”
Section: Multi-scale Aggregationmentioning
confidence: 99%
See 1 more Smart Citation
“…To address this problem, several studies have applied a pooling layer or temporal average layer to an end-to-end system [2,3]. The second is a speaker embedding-based system [4][5][6][7][8][9][10][11][12][13][14], which generates an input of variable length into a vector of fixed length using a DNN. The generated vector is used as an embedding to represent the speaker.…”
Section: Introductionmentioning
confidence: 99%
“…In addition, back-end methods, for example, probabilistic linear discriminant analysis, can be used [8][9][10]. The most important part in the above system is the speaker embedding generation [13]. Speaker embedding is a high-dimensional feature vector that contains speaker information.…”
mentioning
confidence: 99%