End-to-end text-dependent speaker verification

Heigold, Georg; Moreno, Ignacio López; Bengio, Samy; Shazeer, Noam

doi:10.1109/icassp.2016.7472652

Cited by 495 publications

(431 citation statements)

References 20 publications

Supporting

Mentioning

413

Contrasting

Order By: Relevance

“…The proposed method has a close connection to the softmax classifier on the class-center learning method. The objective function (5) aims to maximize the pAUC of the pairwise training set T t 1 at a mini-batch iteration, while the cross-entropy minimization with softmax aims to classify the t1 utterances that are used to construct the T t 1 . The class centers {wu} U u=1 are used for constructing T t 1 in the pAUC optimization, and used as the parameters of the softmax classifier in (7).…”

Section: Connection To Cross-entropy Minimization With Softmaxmentioning

confidence: 99%

Partial AUC Optimization Based Deep Speaker Embeddings with Class-Center Learning for Text-Independent Speaker Verification

Bai

Zhang

Chen

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Deep embedding based text-independent speaker verification has demonstrated superior performance to traditional methods in many challenging scenarios. Its loss functions can be generally categorized into two classes, i.e., verification and identification. The verification loss functions match the pipeline of speaker verification, but their implementations are difficult. Thus, most state-of-the-art deep embedding methods use the identification loss functions with softmax output units or their variants. In this paper, we propose a verification loss function, named the maximization of partial area under the Receiver-operating-characteristic (ROC) curve (pAUC), for deep embedding based text-independent speaker verification. We also propose a class-center based training trial construction method to improve the training efficiency, which is critical for the proposed loss function to be comparable to the identification loss in performance. Experiments on the Speaker in the Wild (SITW) and NIST SRE 2016 datasets show that the proposed pAUC loss function is highly competitive with the state-of-the-art identification loss functions.

show abstract

Section: Connection To Cross-entropy Minimization With Softmaxmentioning

confidence: 99%

Partial AUC Optimization Based Deep Speaker Embeddings with Class-Center Learning for Text-Independent Speaker Verification

Bai

Zhang

Chen

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…ASV is undisputedly a crucial technology for biometric identification, which is broadly applied in real-world applications like banking and home automation. Considerable performance improvements in terms of both accuracy and efficiency of ASV systems have been achieved through active research in a diversity of approaches [1][2][3][4][5][6]. [4] proposed a method that use the Gaussian mixture model to extract acoustic features and then apply the likelihood ratio for scoring.…”

Section: Introductionmentioning

confidence: 99%

“…Haibin Wu and Hung-yi Lee were supported by the Ministry of Science and Technology of Taiwan. by [5] to improve verification accuracy and make the ASV model compact and efficient.…”

Section: Introductionmentioning

confidence: 99%

Defense Against Adversarial Attacks on Spoofing Countermeasures of ASV

Liu

Meng

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Various forefront countermeasure methods for automatic speaker verification (ASV) with considerable performance in anti-spoofing are proposed in the ASVspoof 2019 challenge. However, previous work has shown that countermeasure models are vulnerable to adversarial examples indistinguishable from natural data. A good countermeasure model should not only be robust against spoofing audio, including synthetic, converted, and replayed audios; but counteract deliberately generated examples by malicious adversaries. In this work, we introduce a passive defense method, spatial smoothing, and a proactive defense method, adversarial training, to mitigate the vulnerability of ASV spoofing countermeasure models against adversarial examples. This paper is among the first to use defense methods to improve the robustness of ASV spoofing countermeasure models under adversarial attacks. The experimental results show that these two defense methods positively help spoofing countermeasure models counter adversarial examples.

show abstract

“…Given a test recording, the embedding for this recording is compared against the embeddings generated from the enrolment utterances using a suitable distance metric. Speaker verification algorithms can be characterised based on whether the phonetic content in the inputs is limited, which is known as text-dependent speaker verification [9]. Alternatively, textindependent systems operate with no restrictions on the phonetic content [3].…”

Section: Introductionmentioning

confidence: 99%

Multi-Task Learning for Speaker Verification and Voice Trigger Detection

Sigtia

Marchi

Kajarekar

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Automatic speech transcription and speaker recognition are usually treated as separate tasks even though they are interdependent. In this study, we investigate training a single network to perform both tasks jointly. We train the network in a supervised multi-task learning setup, where the speech transcription branch of the network is trained to minimise a phonetic connectionist temporal classification (CTC) loss while the speaker recognition branch of the network is trained to label the input sequence with the correct label for the speaker. We present a large-scale empirical study where the model is trained using several thousand hours of labelled training data for each task. We evaluate the speech transcription branch of the network on a voice trigger detection task while the speaker recognition branch is evaluated on a speaker verification task. Results demonstrate that the network is able to encode both phonetic and speaker information in its learnt representations while yielding accuracies at least as good as the baseline models for each task, with the same number of parameters as the independent models.

show abstract

End-to-end text-dependent speaker verification

Cited by 495 publications

References 20 publications

Partial AUC Optimization Based Deep Speaker Embeddings with Class-Center Learning for Text-Independent Speaker Verification

Partial AUC Optimization Based Deep Speaker Embeddings with Class-Center Learning for Text-Independent Speaker Verification

Defense Against Adversarial Attacks on Spoofing Countermeasures of ASV

Multi-Task Learning for Speaker Verification and Voice Trigger Detection

Contact Info

Product

Resources

About