Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2177
|View full text |Cite
|
Sign up to set email alerts
|

Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification

Abstract: In this paper, we propose a new pooling method called spatial pyramid encoding (SPE) to generate speaker embeddings for text-independent speaker verification. We first partition the output feature maps from a deep residual network (ResNet) into increasingly fine sub-regions and extract speaker embeddings from each sub-region through a learnable dictionary encoding layer. These embeddings are concatenated to obtain the final speaker representation. The SPE layer not only generates a fixed-dimensional speaker em… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
28
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
2

Relationship

2
4

Authors

Journals

citations
Cited by 28 publications
(28 citation statements)
references
References 35 publications
0
28
0
Order By: Relevance
“…The TDNN is composed of 1D convolution and fully connected layers. Several later studies [16,19,21,22,25,27] suggest replacing the TDNN with variants of ResNet34, composed of 2D convolutions, as the encoder network. We used TDNN and ResNet34 for x-vector and LDE embeddings, respectively.…”
Section: Encoder Networkmentioning
confidence: 99%
See 1 more Smart Citation
“…The TDNN is composed of 1D convolution and fully connected layers. Several later studies [16,19,21,22,25,27] suggest replacing the TDNN with variants of ResNet34, composed of 2D convolutions, as the encoder network. We used TDNN and ResNet34 for x-vector and LDE embeddings, respectively.…”
Section: Encoder Networkmentioning
confidence: 99%
“…Villalba et al summarized several state-of-the-art speaker recognition systems for the NIST SRE18 Challenge [16], where x-vector based systems [17] consistently outperformed i-vector based systems [18]. There has also been a surge of interest in new encoding methods and endto-end loss functions for speaker recognition [19,20,21,22,23,24,25]. One prominent advancement is the use of learnable dictionary encoding (LDE) [19] and angular softmax [20] for speaker recognition, which are reported to boost the speaker recognition performance on open-source corpora such as the VoxCelebs [26,27].…”
Section: Introductionmentioning
confidence: 99%
“…The first module is the frame-level feature extractor which takes a sequence of acoustic features x t and outputs corresponding speaker features h t (t = 1, · · · , T ). In our system, ResNet [19] is used as a feature extractor, which has been widely used in previous studies [20,21,22]. The architecture is described in Table 1.…”
Section: Attention-based Soft Vad With the Sv Systemmentioning
confidence: 99%
“…The minibatch size is 64, and the weight decay parameter is 0.0001. We use the same learning rate schedule as in [22] with the initial learning rate of 0.1.…”
Section: Experimental Setups For Speaker Verificationmentioning
confidence: 99%
“…This assumes that voice activity detection (VAD) has not been applied to remove non-speech frames, which may degrade the SV performance, from the audio. Most SV studies still rely on a traditional energy-based VAD [13], [20], [30], and even some of them do not apply VAD [31], [32]. It is because most SV databases are included in the following two cases, thus minimizing the need of robust VAD: (1) They were recorded in relatively clean conditions, where the naive energy-based VAD performs reasonably well.…”
Section: Introductionmentioning
confidence: 99%