2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
DOI: 10.1109/asru46091.2019.9004029
|View full text |Cite
|
Sign up to set email alerts
|

Short Utterance Compensation in Speaker Verification via Cosine-Based Teacher-Student Learning of Speaker Embeddings

Abstract: The short duration of an input utterance is one of the most critical threats that degrade the performance of speaker verification systems. This study aimed to develop an integrated text-independent speaker verification system that inputs utterances with short duration of 2 seconds or less. We propose an approach using a teacher-student learning framework for this goal, applied to short utterance compensation for the first time in our knowledge. The core concept of the proposed system is to conduct the compensa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
27
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
2

Relationship

3
4

Authors

Journals

citations
Cited by 37 publications
(27 citation statements)
references
References 25 publications
0
27
0
Order By: Relevance
“…The figure shows that the embeddings extracted from long segments have higher discriminative power. This superiority is also interpreted as that of short utterance compensation in the field of speaker verification [3,15].…”
Section: Concatenating Multiple Inputsmentioning
confidence: 99%
See 2 more Smart Citations
“…The figure shows that the embeddings extracted from long segments have higher discriminative power. This superiority is also interpreted as that of short utterance compensation in the field of speaker verification [3,15].…”
Section: Concatenating Multiple Inputsmentioning
confidence: 99%
“…With the inspiration from the previous researches [2,3], we explored two techniques to modify the TS learning to better conduct the ASC task. The first is extraction of soft-labels from multiple input segments.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…The DNN used in this study comprises convolutional neural networks (CNNs), gated recurrent units (GRUs) and fully connected layers (CNN-GRU) as used in [15][16][17]. In this architecture, input features are first processed using convolutional layers to extract frame-level embeddings.…”
Section: End-to-end Dnnmentioning
confidence: 99%
“…A slightly modified ResNet was used for modeling the spectrograms, accounting for different stride sizes for time and frequency domains due to high-resolution in the frequency domain, and the number of residual blocks was adjusted to fit the provided ASV2019 physical access dataset. The raw waveform CNN-GRU model, proposed in [17], was used with a few modifications: one less residual block, a different specified input utterance length at training phase to fit the dataset, and additional loss functions for training (center loss [25] and speaker basis loss [26]). This model first extracts 128-dimensional frame-level representations using 1-dimensional convolutional layers.…”
Section: Dnn Architecturementioning
confidence: 99%