Pushing the limits of raw waveform speaker recognition

Jung, Jee-weon; Kim, Youjin; Heo, Hee-Soo; Lee, Bong-Jin; Kwon, Youngki; Chung, Joon Son

doi:10.21437/interspeech.2022-126

Cited by 37 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…본 연구에서는 화자 인코더로, raw waveform을 입력으로 하 는 모델 중 우수한 화자 식별 성능을 가진 RawNet3를 화자 인코 더로 활용해, 유사한 화자 음색 표현에 강점을 가진 '원샷 다화 자 음성합성' 모델을 구현하였다 (Jung et al, 2022). 그림 2(b)의 RawNet3는 RawNet2와 ECAPA-TDNN 기반의 구조로 이루어진 다. RawNet2 모델의 filterbank 구조를 사용하되, 실수 기반에서 복소수 기반으로 확장하였다 (Desplanques et al, 2020;Jung et al, 2020)…”

Section: 화자 인코더unclassified

One-shot multi-speaker text-to-speech using RawNet3 speaker representation*

Han,

Um,

Kim

2024

Phonetics Speech Sci.

View full text Add to dashboard Cite

Recent advances in text-to-speech (TTS) technology have significantly improved the quality of synthesized speech, reaching a level where it can closely imitate natural human speech. Especially, TTS models offering various voice characteristics and personalized speech, are widely utilized in fields such as artificial intelligence (AI) tutors, advertising, and video dubbing. Accordingly, in this paper, we propose a one-shot multi-speaker TTS system that can ensure acoustic diversity and synthesize personalized voice by generating speech using unseen target speakers' utterances. The proposed model integrates a speaker encoder into a TTS model consisting of the FastSpeech2 acoustic model and the HiFi-GAN vocoder. The speaker encoder, based on the pre-trained RawNet3, extracts speaker-specific voice features. Furthermore, the proposed approach not only includes an English one-shot multi-speaker TTS but also introduces a Korean one-shot multi-speaker TTS. We evaluate naturalness and speaker similarity of the generated speech using objective and subjective metrics. In the subjective evaluation, the proposed Korean one-shot multi-speaker TTS obtained naturalness mean opinion score (NMOS) of 3.36 and similarity MOS (SMOS) of 3.16. The objective evaluation of the proposed English and Korean one-shot multi-speaker TTS showed a prediction MOS (P-MOS) of 2.54 and 3.74, respectively. These results indicate that the performance of our proposed model is improved over the baseline models in terms of both naturalness and speaker similarity.

show abstract

Section: 화자 인코더unclassified

One-shot multi-speaker text-to-speech using RawNet3 speaker representation*

Han,

Um,

Kim

2024

Phonetics Speech Sci.

View full text Add to dashboard Cite

show abstract

“…In particular, all experiments reported in Table 1 relies on the implementation of ECAPA-TDNN [22] available in SpeechBrain [23] because it was found to outperform three open-source alternatives. For instance, on VoxConverse v0.3, the fine-tuned pipeline reaches DER = 14.9% with the xvector implementation available in pyannote.audio [20], 12.0% with NeMo's TitaNet [24], 10.8% with RawNet3 [25], and 10.7% with SpeechBrain's ECAPA-TDNN.…”

Section: Reproducible Benchmarkmentioning

confidence: 99%

pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe

Bredin¹

2023

Interspeech 2023

View full text Add to dashboard Cite

pyannote.audio is an open-source toolkit written in Python for speaker diarization. Version 2.1 introduces a major overhaul of pyannote.audio default speaker diarization pipeline, made of three main stages: speaker segmentation applied to a short sliding window, neural speaker embedding of each (local) speakers, and (global) agglomerative clustering. One of the main objectives of the toolkit is to democratize speaker diarization. Therefore, on top of a pretrained speaker diarization pipeline that gives good results out of the box, we also provide a recipe that practitioners can follow to improve its performance on their own (manually annotated) dataset. It has been used for various challenges and reached 1st place at Ego4D 2022, 1st place at Albayzin 2022, and 6th place at VoxSRC 2022.

show abstract

“…Other comparison-based self-supervised learning techniques include the MOCO framework [38], [39], which stores the negative pairs in the memory bank; the DINO framework [12], [40]- [42] that only involves positive pairs and achieves considerable improvement. For efficiency and effectiveness, we adopt the SCL framework in this study and focus on the sampling strategy of positive pairs.…”

Section: B Self-supervised Learning Of Speaker Encodermentioning

confidence: 99%

Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs

Tao

Lee

Das³

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

We study a novel neural speaker encoder and its training strategies for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed dimensional speaker embedding from a spoken utterance of variable length. Contrastive learning is a typical self-supervised learning technique. However, the contrastive learning of the speaker encoder depends very much on the sampling strategy of positive and negative pairs. It is common that we sample a positive pair of segments from the same utterance. Unfortunately, such a strategy, denoted as poor-man's positive pairs (PPP), lacks the necessary diversity. In this work, we propose a multi-modal contrastive learning technique with novel sampling strategies. By cross-referencing between speech and face data, we find diverse positive pairs (DPP) for contrastive learning, thus improving the robustness of speaker encoder. We train the speaker encoder on the VoxCeleb2 dataset without any speaker labels, and achieve an equal error rate (EER) of 2.89%, 3.17% and 6.27% under the proposed progressive clustering strategy, and an EER of 1.44%, 1.77% and 3.27% under the two-stage learning strategy with pseudo labels, on the three test sets of VoxCeleb1. This novel solution outperforms the state-of-the-art self-supervised learning methods by a large margin, at the same time, achieves comparable results with the supervised learning counterpart. We also evaluate our self-supervised learning technique on the LRS2 and LRW datasets, where speaker information is unavailable. All experiments suggest that the proposed neural architecture and sampling strategies are robust across datasets.

show abstract

Pushing the limits of raw waveform speaker recognition

Cited by 37 publications

References 0 publications

One-shot multi-speaker text-to-speech using RawNet3 speaker representation*

One-shot multi-speaker text-to-speech using RawNet3 speaker representation*

pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe

Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs

Contact Info

Product

Resources

About