Pushing the limits of raw waveform speaker recognition

Jung, Jee-weon; Kim, You Jin; Heo, Hee-Soo; Lee, Bong-Jin; Kwon, Youngki; Chung, Joon Son

doi:10.48550/arxiv.2203.08488

Cited by 2 publications

(1 citation statement)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the embedding extraction step, audio with variable duration is converted into a single fixed-dimensional vector representation called speaker embedding, which is assumed to contain speakerrelevant information. With a sophisticated speaker embedding, even a simple scoring method such as cosine similarity or euclidean distance has shown high speaker verification performance [2]- [4]. Therefore, most studies have been focused on how to extract a fine speaker embedding from input speech.…”

Section: Introductionmentioning

confidence: 99%

Disentangled Speaker Representation Learning via Mutual Information Minimization

Mun¹,

Han²,

Kim³

et al. 2022

Preprint

View full text Add to dashboard Cite

Domain mismatch problem caused by speakerunrelated feature has been a major topic in speaker recognition. In this paper, we propose an explicit disentanglement framework to unravel speaker-relevant features from speakerunrelated features via mutual information (MI) minimization. To achieve our goal of minimizing MI between speaker-related and speaker-unrelated features, we adopt a contrastive log-ratio upper bound (CLUB), which exploits the upper bound of MI. Our framework is constructed in a 3-stage structure. First, in the front-end encoder, input speech is encoded into shared initial embedding. Next, in the decoupling block, shared initial embedding is split into separate speaker-related and speakerunrelated embeddings. Finally, disentanglement is conducted by MI minimization in the last stage. Experiments on Far-Field Speaker Verification Challenge 2022 (FFSVC2022) demonstrate that our proposed framework is effective for disentanglement. Also, to utilize domain-unknown datasets containing numerous speakers, we pre-trained the front-end encoder with VoxCeleb datasets. We then fine-tuned the speaker embedding model in the disentanglement framework with FFSVC 2022 dataset. The experimental results show that fine-tuning with a disentanglement framework on a existing pre-trained model is valid and can further improve performance.

show abstract

Section: Introductionmentioning

confidence: 99%