We propose a novel framework for speaker recognition in which extraction of sufficient statistics for the state-of-the-art i-vector model is driven by a deep neural network (DNN) trained for automatic speech recognition (ASR). Specifically, the DNN replaces the standard Gaussian mixture model (GMM) to produce frame alignments. The use of an ASR-DNN system in the speaker recognition pipeline is attractive as it integrates the information from speech content directly into the statistics, allowing the standard backends to remain unchanged. Improvement from the proposed framework compared to a state-of-the-art system are of 30% relative at the equal error rate when evaluated on the telephone conditions from the 2012 NIST speaker recognition evaluation (SRE). The proposed framework is a successful way to efficiently leverage transcribed data for speaker recognition, thus opening up a wide spectrum of research directions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.