Speaker Verification Using End-to-end Adversarial Language Adaptation

Rohdin, Johan; Stafylakis, Themos; Silnova, Anna; Zeinali, Hossein; Burget, Lukáš; Plchot, Oldřich

doi:10.1109/icassp.2019.8683616

Cited by 40 publications

(23 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Utilizing adversarial networks has also been explored for speaker recognition. Proposed solutions in [32,33,27,27,28] explore the use of adversarial networks and generative adversarial networks both as discriminative models for verification of the speaker as well as generative models. Such generative models were mostly used to transform the conditions of the utterance into more convenient environments in which to perform the speaker recognition task.…”

Section: Related Workmentioning

confidence: 99%

A Deep Neural Network for Short-Segment Speaker Recognition

Hajavi¹,

Etemad²

2019

Interspeech 2019

View full text Add to dashboard Cite

Today's interactive devices such as smart-phone assistants and smart speakers often deal with short-duration speech segments. As a result, speaker recognition systems integrated into such devices will be much better suited with models capable of performing the recognition task with short-duration utterances. In this paper, a new deep neural network, UtterIdNet, capable of performing speaker recognition with short speech segments is proposed. Our proposed model utilizes a novel architecture that makes it suitable for short-segment speaker recognition through an efficiently increased use of information in short speech segments. UtterIdNet has been trained and tested on the VoxCeleb datasets, the latest benchmarks in speaker recognition. Evaluations for different segment durations show consistent and stable performance for short segments, with significant improvement over the previous models for segments of 2 seconds, 1 second, and especially sub-second durations (250 ms and 500 ms).

show abstract

Section: Related Workmentioning

confidence: 99%

A Deep Neural Network for Short-Segment Speaker Recognition

Hajavi¹,

Etemad²

2019

Interspeech 2019

View full text Add to dashboard Cite

show abstract

“…Our approach of conditioning the reconstruction on the estimated phone sequence of each segment can be employed, enabling such approaches to be revisited in an end-to-end fashion. Other recent approaches aiming at enhancing the x-vector architecture with adversarial loss are also relevant, since they are propose joint training of the network with auxiliary losses and structures which are removed in runtime [8,9,10].…”

Section: Related Work 21 Speaker Recognition Using Autoencodersmentioning

confidence: 99%

Self-Supervised Speaker Embeddings

et al. 2019

Self Cite

View full text Add to dashboard Cite

Contrary to i-vectors, speaker embeddings such as x-vectors are incapable of leveraging unlabelled utterances, due to the classification loss over training speakers. In this paper, we explore an alternative training strategy to enable the use of unlabelled utterances in training. We propose to train speaker embedding extractors via reconstructing the frames of a target speech segment, given the inferred embedding of another speech segment of the same utterance. We do this by attaching to the standard speaker embedding extractor a decoder network, which we feed not merely with the speaker embedding, but also with the estimated phone sequence of the target frame sequence.The reconstruction loss can be used either as a single objective, or be combined with the standard speaker classification loss. In the latter case, it acts as a regularizer, encouraging generalizability to speakers unseen during training. In all cases, the proposed architectures are trained from scratch and in an endto-end fashion. We demonstrate the benefits from the proposed approach on VoxCeleb and Speakers in the wild, and we report notable improvements over the baseline.

show abstract

“…To improve the performance of x-vectors, recently proposed methods for applying domain adaptation to the x-vector extractor (e.g. using Generative Adversarial Networks [41], [42]) are worth exploring, in order to reduce the mismatch in channel and accent between VoxCeleb and RSR2015.…”

Section: Comparison With X-vectormentioning

confidence: 99%

Speaker Recognition With Random Digit Strings Using Uncertainty Normalized HMM-Based i-Vectors

Maghsoodi

Sameti

Zeinali

et al. 2019

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

In this paper, we combine Hidden Markov Models (HMMs) with i-vector extractors to address the problem of textdependent speaker recognition with random digit strings. We employ digit-specific HMMs to segment the utterances into digits, to perform frame alignment to HMM states and to extract Baum-Welch statistics. By making use of the natural partition of input features into digits, we train digit-specific i-vector extractors on top of each HMM and we extract well-localized i-vectors, each modelling merely the phonetic content corresponding to a single digit. We then examine ways to perform channel and uncertainty compensation, and we propose a novel method for using the uncertainty in the i-vector estimates. The experiments on RSR2015 part III show that the proposed method attains 1.52% and 1.77% Equal Error Rate (EER) for male and female respectively, outperforming state-of-the-art methods such as xvectors, trained on vast amounts of data. Furthermore, these results are attained by a single system trained entirely on RSR2015, and by a simple score-normalized cosine distance. Moreover, we show that the omission of channel compensation yields only a minor degradation in performance, meaning that the system attains state-of-the-art results even without recordings from multiple handsets per speaker for training or enrolment. Similar conclusions are drawn from our experiments on the RedDots corpus, where the same method is evaluated on phrases. Finally, we report results with bottleneck features and show that further improvement is attained when fusing them with spectral features.

show abstract

Speaker Verification Using End-to-end Adversarial Language Adaptation

Cited by 40 publications

References 5 publications

A Deep Neural Network for Short-Segment Speaker Recognition

A Deep Neural Network for Short-Segment Speaker Recognition

Self-Supervised Speaker Embeddings

Speaker Recognition With Random Digit Strings Using Uncertainty Normalized HMM-Based i-Vectors

Contact Info

Product

Resources

About