Shortcut Connections Based Deep Speaker Embeddings for End-to-End Speaker Verification System

Seo, Soonshin; Rim, Daniel Jun; Lim, Minkyu; Lee, Dong-Hyun; Park, Hosung; Oh, Jun-Seok; Kim, Changmin; Kim, Ji‐Hwan

doi:10.21437/interspeech.2019-2195

Cited by 22 publications

(31 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For all the pooling methods mentioned above, we use only single-scale feature map from the last layer of the feature extractor. Recently, multi-scale aggregation (MSA) methods have been proposed to exploit speaker information at multiple time scales [22], [23], [48], [49], showing the effectiveness in dealing with variable-duration test utterances.…”

Section: Deep Speaker Embedding Learningmentioning

confidence: 99%

“…Even with this robustness, using multi-scale features from multiple layers (Fig. 2(b)), called multi-scale aggregation (MSA), has shown better performance than using singlescale feature maps [22], [23], [48], [49]. Note that, between the frame-and segment-level operations, we should choose the segment-level operation for the MSA because all the feature maps from different layers have the same time scale in the frame-level operation.…”

Section: Multi-scale Aggregationmentioning

confidence: 99%

“…Seo et al [49] also utilized features from different stages of ResNet to combine information at different time-frequency scales. Different from the approach of Gao et al, GAP was applied to the feature maps respectively and the resulting pooled feature vectors were concatenated into a long vector.…”

Section: A Related Workmentioning

confidence: 99%

See 2 more Smart Citations

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

et al. 2020

View full text Add to dashboard Cite

Speaker verification (SV) has recently attracted considerable research interest due to the growing popularity of virtual assistants. At the same time, there is an increasing requirement for an SV system: it should be robust to short speech segments, especially in noisy and reverberant environments. In this paper, we consider one more important requirement for practical applications: the system should be robust to an audio stream containing long non-speech segments, where a voice activity detection (VAD) is not applied. To meet these two requirements, we introduce feature pyramid module (FPM)-based multiscale aggregation (MSA) and self-adaptive soft VAD (SAS-VAD). We present the FPM-based MSA to deal with short speech segments in noisy and reverberant environments. Also, we use the SAS-VAD to increase the robustness to long non-speech segments. To further improve the robustness to acoustic distortions (i.e., noise and reverberation), we apply a masking-based speech enhancement (SE) method. We combine SV, VAD, and SE models in a unified deep learning framework and jointly train the entire network in an endto-end manner. To the best of our knowledge, this is the first work combining these three models in a deep learning framework. We conduct experiments on Korean indoor (KID) and VoxCeleb datasets, which are corrupted by noise and reverberation. The results show that the proposed method is effective for SV in the challenging conditions and performs better than the baseline i-vector and deep speaker embedding systems.

show abstract

Section: Deep Speaker Embedding Learningmentioning

confidence: 99%

Section: Multi-scale Aggregationmentioning

confidence: 99%

See 1 more Smart Citation

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

et al. 2020

View full text Add to dashboard Cite

show abstract

“…To address this problem, several studies have applied a pooling layer or temporal average layer to an end-to-end system [2,3]. The second is a speaker embedding-based system [4][5][6][7][8][9][10][11][12][13][14], which generates an input of variable length into a vector of fixed length using a DNN. The generated vector is used as an embedding to represent the speaker.…”

Section: Introductionmentioning

confidence: 99%

“…In addition, back-end methods, for example, probabilistic linear discriminant analysis, can be used [8][9][10]. The most important part in the above system is the speaker embedding generation [13]. Speaker embedding is a high-dimensional feature vector that contains speaker information.…”

mentioning

confidence: 99%

Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System

Seo

Kim

2020

Electronics

Self Cite

View full text Add to dashboard Cite

One of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are relatively large in number, and unspecified variations increase in the multi-layer aggregation. Therefore, in this study, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. To reduce the number of model parameters, we set the ResNet with the scaled channel width and layer depth as a baseline. To control the variability in the training, we apply a self-attention mechanism to perform multi-layer aggregation with dropout regularizations and batch normalizations. Subsequently, we apply a feature recalibration layer to the aggregated feature using fully-connected layers and nonlinear activation functions. Further, deep length normalization is used on a recalibrated feature in the training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively).

show abstract

ASTT: acoustic spatial-temporal transformer for short utterance speaker recognition

Deng

et al. 2023

Multimed Tools Appl

View full text Add to dashboard Cite

Shortcut Connections Based Deep Speaker Embeddings for End-to-End Speaker Verification System

Cited by 22 publications

References 0 publications

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System

ASTT: acoustic spatial-temporal transformer for short utterance speaker recognition

Contact Info

Product

Resources

About