Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2240
|View full text |Cite
|
Sign up to set email alerts
|

A Deep Neural Network for Short-Segment Speaker Recognition

Abstract: Today's interactive devices such as smart-phone assistants and smart speakers often deal with short-duration speech segments. As a result, speaker recognition systems integrated into such devices will be much better suited with models capable of performing the recognition task with short-duration utterances. In this paper, a new deep neural network, UtterIdNet, capable of performing speaker recognition with short speech segments is proposed. Our proposed model utilizes a novel architecture that makes it suitab… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
54
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 58 publications
(57 citation statements)
references
References 51 publications
(121 reference statements)
1
54
0
Order By: Relevance
“…For all the pooling methods mentioned above, we use only single-scale feature map from the last layer of the feature extractor. Recently, multi-scale aggregation (MSA) methods have been proposed to exploit speaker information at multiple time scales [22], [23], [48], [49], showing the effectiveness in dealing with variable-duration test utterances.…”
Section: Deep Speaker Embedding Learningmentioning
confidence: 99%
See 2 more Smart Citations
“…For all the pooling methods mentioned above, we use only single-scale feature map from the last layer of the feature extractor. Recently, multi-scale aggregation (MSA) methods have been proposed to exploit speaker information at multiple time scales [22], [23], [48], [49], showing the effectiveness in dealing with variable-duration test utterances.…”
Section: Deep Speaker Embedding Learningmentioning
confidence: 99%
“…Even with this robustness, using multi-scale features from multiple layers (Fig. 2(b)), called multi-scale aggregation (MSA), has shown better performance than using singlescale feature maps [22], [23], [48], [49]. Note that, between the frame-and segment-level operations, we should choose the segment-level operation for the MSA because all the feature maps from different layers have the same time scale in the frame-level operation.…”
Section: Multi-scale Aggregationmentioning
confidence: 99%
See 1 more Smart Citation
“…2-D CNNs have also shown competitive results for speaker verification. There are Computer Vision architectures such as VGG [10,7,11,9] and ResNet [8,12,13] that have been adapted to capture speaker discriminative information from the Mel-Spectrograms. In fact, Resnet34 has shown a better performance than TDNN in the most recent speaker verification challenges [14,15].…”
Section: Introductionmentioning
confidence: 99%
“…[5] proposed the statistics pooling layer, which combines mean and standard deviation statistics for weighted aggregation of temporal frames. More recently, [6] proposed time-distributed voting (TDV) for aggregating features extracted by their UtterId-Net front-end in short segment speaker verification, especially sub-second durations. [7] proposed the usage of dictionarybased NetVLAD or GhostVLAD [8] for aggregating temporal features, using a 34-layer ResNet based front-end for feature extraction.…”
Section: Introductionmentioning
confidence: 99%