Koji Okabe scite author profile

This paper proposes attentive statistics pooling for deep speaker embedding in text-independent speaker verification. In conventional speaker embedding, frame-level features are averaged over all the frames of a single utterance to form an utterance-level feature. Our method utilizes an attention mechanism to give different weights to different frames and generates not only weighted means but also weighted standard deviations. In this way, it can capture long-term variations in speaker characteristics more effectively. An evaluation on the NIST SRE 2012 and the VoxCeleb data sets shows that it reduces equal error rates (EERs) from the conventional method by 7.5% and 8.1%, respectively.

show abstract

Attention Mechanism in Speaker Recognition: What Does it Learn in Deep Speaker Embedding?

Wang

Okabe

Lee

et al. 2018

View full text Add to dashboard Cite

This paper presents an experimental study on deep speaker embedding with an attention mechanism that has been found to be a powerful representation learning technique in speaker recognition. In this framework, an attention model works as a frame selector that computes an attention weight for each frame-level feature vector, in accord with which an utterancelevel representation is produced at the pooling layer in a speaker embedding network. In general, an attention model is trained together with the speaker embedding network on a single objective function, and thus those two components are tightly bound to one another. In this paper, we consider the possibility that the attention model might be decoupled from its parent network and assist other speaker embedding networks and even conventional i-vector extractors. This possibility is demonstrated through a series of experiments on a NIST Speaker Recognition Evaluation (SRE) task, with 9.0% EER reduction and 3.8% minC primary reduction when the attention weights are applied to i-vector extraction. Another experiment shows that DNN-based soft voice activity detection (VAD) can be effectively combined with the attention mechanism to yield further reduction of minC primary by 6.6% and 1.6% in deep speaker embedding and i-vector systems, respectively.

show abstract

Speaker Augmentation and Bandwidth Extension for Deep Speaker Embedding

Yamamoto

Lee

Okabe

et al. 2019

View full text Add to dashboard Cite

NEC-TT System for Mixed-Bandwidth and Multi-Domain Speaker Recognition

Lee

Yamamoto

Okabe

et al. 2020

Computer Speech & Language

View full text Add to dashboard Cite

Speaker Detection in the Wild: Lessons Learned from JSALT 2019

García¹,

Villalba²,

Bredin³

et al. 2020

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Koji Okabe

Attentive Statistics Pooling for Deep Speaker Embedding

Attention Mechanism in Speaker Recognition: What Does it Learn in Deep Speaker Embedding?

Speaker Augmentation and Bandwidth Extension for Deep Speaker Embedding

NEC-TT System for Mixed-Bandwidth and Multi-Domain Speaker Recognition

Speaker Detection in the Wild: Lessons Learned from JSALT 2019

Contact Info

Product

Resources

About