Short Utterance Compensation in Speaker Verification via Cosine-Based Teacher-Student Learning of Speaker Embeddings

Jung, Jee-weon; Heo, Hee-Soo; Shim, Hye-jin; Yu, Ha-Jin

doi:10.1109/asru46091.2019.9004029

Cited by 37 publications

(27 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The figure shows that the embeddings extracted from long segments have higher discriminative power. This superiority is also interpreted as that of short utterance compensation in the field of speaker verification [3,15].…”

Section: Concatenating Multiple Inputsmentioning

confidence: 99%

“…With the inspiration from the previous researches [2,3], we explored two techniques to modify the TS learning to better conduct the ASC task. The first is extraction of soft-labels from multiple input segments.…”

Section: Introductionmentioning

confidence: 99%

“…This method enables extracting more general soft-labels and also includes the effect of data augmentation. The second is direct comparison of embeddings rather than the output layer, proposed in [3], was applied for the ASC task. We verified that the combination of the two techniques can further improve the performance of the ASC system.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Acoustic Scene Classification Using Teacher-Student Learning with Soft-Labels

et al. 2019

Self Cite

View full text Add to dashboard Cite

Acoustic scene classification identifies an input segment into one of the pre-defined classes using spectral information. The spectral information of acoustic scenes may not be mutually exclusive due to common acoustic properties across different classes, such as babble noises included in both airports and shopping malls. However, conventional training procedure based on one-hot labels does not consider the similarities between different acoustic scenes. We exploit teacher-student learning with the purpose to derive soft-labels that consider common acoustic properties among different acoustic scenes. In teacher-student learning, the teacher network produces softlabels, based on which the student network is trained. We investigate various methods to extract soft-labels that better represent similarities across different scenes. Such attempts include extracting soft-labels from multiple audio segments that are defined as an identical acoustic scene. Experimental results demonstrate the potential of our approach, showing a classification accuracy of 77.36 % on the DCASE 2018 task 1 validation set.

show abstract

Section: Concatenating Multiple Inputsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Acoustic Scene Classification Using Teacher-Student Learning with Soft-Labels

et al. 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…The DNN used in this study comprises convolutional neural networks (CNNs), gated recurrent units (GRUs) and fully connected layers (CNN-GRU) as used in [15][16][17]. In this architecture, input features are first processed using convolutional layers to extract frame-level embeddings.…”

Section: End-to-end Dnnmentioning

confidence: 99%

“…A slightly modified ResNet was used for modeling the spectrograms, accounting for different stride sizes for time and frequency domains due to high-resolution in the frequency domain, and the number of residual blocks was adjusted to fit the provided ASV2019 physical access dataset. The raw waveform CNN-GRU model, proposed in [17], was used with a few modifications: one less residual block, a different specified input utterance length at training phase to fit the dataset, and additional loss functions for training (center loss [25] and speaker basis loss [26]). This model first extracts 128-dimensional frame-level representations using 1-dimensional convolutional layers.…”

Section: Dnn Architecturementioning

confidence: 99%

Replay Attack Detection with Complementary High-Resolution Information Using End-to-End DNN for the ASVspoof 2019 Challenge

Jung¹,

Shim²,

Heo³

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

In this study, we concentrate on replacing the process of extracting hand-crafted acoustic feature with end-to-end DNN using complementary high-resolution spectrograms. As a result of advance in audio devices, typical characteristics of a replayed speech based on conventional knowledge alter or diminish in unknown replay configurations. Thus, it has become increasingly difficult to detect spoofed speech with a conventional knowledge-based approach. To detect unrevealed characteristics that reside in a replayed speech, we directly input spectrograms into an end-to-end DNN without knowledge-based intervention. Explorations dealt in this study that differentiates from existing spectrogram-based systems are twofold: complementary information and high-resolution. Spectrograms with different information are explored, and it is shown that additional information such as the phase information can be complementary. High-resolution spectrograms are employed with the assumption that the difference between a bona-fide and a replayed speech exists in the details. Additionally, to verify whether other features are complementary to spectrograms, we also examine raw waveform and an i-vector based system. Experiments conducted on the ASVspoof 2019 physical access challenge show promising results, where t-DCF and equal error rates are 0.0570 and 2.45 % for the evaluation set, respectively.

show abstract

Resformer: Local Frame-Level Feature and Global Segment-Level Feature Joint Learning for Speaker Verification

Zi,

Xiong

2024

Circuits Syst Signal Process

View full text Add to dashboard Cite

Short Utterance Compensation in Speaker Verification via Cosine-Based Teacher-Student Learning of Speaker Embeddings

Cited by 37 publications

References 25 publications

Acoustic Scene Classification Using Teacher-Student Learning with Soft-Labels

Acoustic Scene Classification Using Teacher-Student Learning with Soft-Labels

Replay Attack Detection with Complementary High-Resolution Information Using End-to-End DNN for the ASVspoof 2019 Challenge

Resformer: Local Frame-Level Feature and Global Segment-Level Feature Joint Learning for Speaker Verification

Contact Info

Product

Resources

About