Text-independent speaker recognition is widely used in identity recognition. In order to improve the features recognition ability, a method of text-independent speaker recognition based on a deep residual network model was proposed in this paper. Firstly, the original audio was extracted with a 64-dimensional log filter bank signal features. Secondly, a deep residual network was used to extract log filter bank signal features. The deep residual network was composed of a residual network and a Convolutional Attention Statistics Pooling (CASP) layer. The CASP layer could aggregate the frame-level features from the residual network into utterance-level features. Lastly, Adaptive Curriculum Learning Loss (ACLL) classifiers was used to optimize the output of abstract features by the deep residual network, and the text-independent speaker recognition was completed by ACLL classifiers. The proposed method was applied to a large VoxCeleb2 dataset for extensive text-independent speaker recognition experiments, and average equal error rate (EER) could achieve 1.76% on VoxCeleb1 test dataset, 1.91% on VoxCeleb1-E test dataset, and 3.24% on VoxCeleb1-H test dataset. Compared with related speaker recognition methods, EER was improved by 1.11% on VoxCeleb1 test dataset, 1.04% on VoxCeleb1-E test dataset, and 1.69% on VoxCeleb1-H test dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.