2022
DOI: 10.48550/arxiv.2203.08488
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Pushing the limits of raw waveform speaker recognition

Abstract: In recent years, speaker recognition systems based on raw waveform inputs have received increasing attention. However, the performance of such systems are typically inferior to the state-of-the-art handcrafted feature-based counterparts, which demonstrate equal error rates under 1% on the popular VoxCeleb1 test set. This paper proposes a novel speaker recognition model based on raw waveform inputs. The model incorporates recent advances in machine learning and speaker verification, including the Res2Net backbo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(1 citation statement)
references
References 26 publications
0
1
0
Order By: Relevance
“…In the embedding extraction step, audio with variable duration is converted into a single fixed-dimensional vector representation called speaker embedding, which is assumed to contain speakerrelevant information. With a sophisticated speaker embedding, even a simple scoring method such as cosine similarity or euclidean distance has shown high speaker verification performance [2]- [4]. Therefore, most studies have been focused on how to extract a fine speaker embedding from input speech.…”
Section: Introductionmentioning
confidence: 99%
“…In the embedding extraction step, audio with variable duration is converted into a single fixed-dimensional vector representation called speaker embedding, which is assumed to contain speakerrelevant information. With a sophisticated speaker embedding, even a simple scoring method such as cosine similarity or euclidean distance has shown high speaker verification performance [2]- [4]. Therefore, most studies have been focused on how to extract a fine speaker embedding from input speech.…”
Section: Introductionmentioning
confidence: 99%