Open-Set Short Utterance Forensic Speaker Verification Using Teacher-Student Network with Explicit Inductive Bias

Sang, Mufan; Xia, Wei; Hansen, John H. L.

doi:10.21437/interspeech.2020-2868

Cited by 14 publications

(6 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we compare the proposed lightweight method to six state-of-the-art methods for lightweight SV, including the ECAPA-TDNNLite [50], EfficientTDNN [51], KD-based [52], Thin-ResNet34 [64], Fast-ResNet34 [65], and CSTCTS1dConv (Channel Split Time-Channel-Time Separable 1-dimensional Convolution) [66]. The ECAPA-TDNNLite based method [50] is a lightweight version of the ECAPA-TDNN based method, in which a large model, ECAPA-TDNN, is utilized for enrollment and a small model, ECAPA-TDNNLite, is used for verification.…”

Section: Comparison Of Different Methodsmentioning

confidence: 99%

“…Additionally, some researchers applied the techniques of Knowledge Distillation (KD) [46], [47] and Neural Architecture Search (NAS) [48] to implement lightweight SV [49]- [52]. In the work of [49], the strategy of teacher-student training was proposed for text-independent SV, and competitive error rate with 88-93% smaller models was obtained.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Few-Shot Speaker Identification Using Lightweight Prototypical Network With Feature Grouping and Interaction

Chen

Cao

et al. 2023

IEEE Trans. Multimedia

View full text Add to dashboard Cite

Although many efforts have been made on decreasing the model complexity for speaker verification, it is still challenging to deploy speaker verification systems with satisfactory result on low-resource terminals. We design a transformation module that performs feature partition and fusion to implement lightweight speaker verification. The transformation module consists of multiple simple but effective operations, such as convolution, pooling, mean, concatenation, normalization, and element-wise summation. It works in a plug-and-play way, and can be easily implanted into a wide variety of models to reduce the model complexity while maintaining the model error. First, the input feature is split into several low-dimensional feature subsets for decreasing the model complexity. Then, each feature subset is updated by fusing it with the inter-feature-subsets correlational information to enhance its representational capability. Finally, the updated feature subsets are independently fed into the block (one or several layers) of the model for further processing. The features that are output from current block of the model are processed according to the steps above before they are fed into the next block of the model. Experimental data are selected from two public speech corpora (namely VoxCeleb1 and VoxCeleb2). Results show that implanting the transformation module into three models (namely AMCRN, ResNet34, and ECAPA-TDNN) for speaker verification slightly increases the model error and significantly decreases the model complexity. Our proposed method outperforms baseline methods on the whole in memory requirement and computational complexity with lower equal error rate. It also generalizes well across truncated segments with various lengths.

show abstract

Section: Comparison Of Different Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Few-Shot Speaker Identification Using Lightweight Prototypical Network With Feature Grouping and Interaction

Chen

Cao

et al. 2023

IEEE Trans. Multimedia

View full text Add to dashboard Cite

show abstract

“…They trained the model to decrease the distances between speaker embeddings extracted from the same speaker utterances and increase the distances between speaker embeddings of different speakers. In addition to these loss functions, the prior studies have investigated diverse methods, such as data augmentation [17,18], network architectures [19,20], and system frameworks [21,22].…”

Section: Related Workmentioning

confidence: 99%

A Supervised Learning Method for Improving the Generalization of Speaker Verification Systems by Learning Metrics from a Mean Teacher

et al. 2021

View full text Add to dashboard Cite

The majority of recent speaker verification tasks are studied under open-set evaluation scenarios considering real-world conditions. The characteristics of these tasks imply that the generalization towards unseen speakers is a critical capability. Thus, this study aims to improve the generalization of the system for the performance enhancement of speaker verification. To achieve this goal, we propose a novel supervised-learning-method-based speaker verification system using the mean teacher framework. The mean teacher network refers to the temporal averaging of deep neural network parameters, which can produce a more accurate, stable representations than fixed weights at the end of training and is conventionally used for semi-supervised learning. Leveraging the success of the mean teacher framework in many studies, the proposed supervised learning method exploits the mean teacher network as an auxiliary model for better training of the main model, the student network. By learning the reliable intermediate representations derived from the mean teacher network as well as one-hot speaker labels, the student network is encouraged to explore more discriminative embedding spaces. The experimental results demonstrate that the proposed method relatively reduces the equal error rate by 11.61%, compared to the baseline system.

show abstract

“…For input-level, models usually can be adapted by training with enhanced [8] or domain-translated [9] input features. For adaptation at embedding-level, it often targets at minimizing certain distances between source and target domains to align them in the same embedding space, such as cosine distance [10], mean squared error (MSE) [11], and maximum mean discrepancy (MMD) [12]. However, this method usually requires parallel or artificial simulated data, which cannot generalize well to real-world scenarios.…”

Section: Introductionmentioning

confidence: 99%

DEAAN: Disentangled Embedding and Adversarial Adaptation Network for Robust Speaker Representation Learning

Sang¹,

Xia²,

Hansen³

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Despite speaker verification has achieved significant performance improvement with the development of deep neural networks, domain mismatch is still a challenging problem in this field. In this study, we propose a novel framework to disentangle speaker-related and domain-specific features and apply domain adaptation on the speaker-related feature space solely. Instead of performing domain adaptation directly on the feature space where domain information is not removed, using disentanglement can efficiently boost adaptation performance. To be specific, our model's input speech from the source and target domains is first encoded into different latent feature spaces. The adversarial domain adaptation is conducted on the shared speaker-related feature space to encourage the property of domain-invariance. Further, we minimize the mutual information between speaker-related and domain-specific features for both domains to enforce the disentanglement. Experimental results on the VOiCES dataset demonstrate that our proposed framework can effectively generate more speaker-discriminative and domain-invariant speaker representations with a relative 20.3% reduction of EER compared to the original ResNet-based system.

show abstract

Open-Set Short Utterance Forensic Speaker Verification Using Teacher-Student Network with Explicit Inductive Bias

Cited by 14 publications

References 25 publications

Few-Shot Speaker Identification Using Lightweight Prototypical Network With Feature Grouping and Interaction

Few-Shot Speaker Identification Using Lightweight Prototypical Network With Feature Grouping and Interaction

A Supervised Learning Method for Improving the Generalization of Speaker Verification Systems by Learning Metrics from a Mean Teacher

DEAAN: Disentangled Embedding and Adversarial Adaptation Network for Robust Speaker Representation Learning

Contact Info

Product

Resources

About