Deep multi-metric learning for text-independent speaker verification

Xu, Jin; Wang, Xinggang; Feng, Bin; Liu, Wenyu

doi:10.1016/j.neucom.2020.06.045

Cited by 27 publications

(8 citation statements)

References 49 publications

(66 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A ResNet-SE block mainly consists of convolution layers. Filters in the convolution layer explicitly model local features and allow spatial translation invariance, which make convolution layer suitable to extract frame level features [28]. SE block expands the temporal context of the frame level information by modeling channel interdependence in features, which has been verified to be helpful in speaker verification task [28].…”

Section: Robust One-shot Voice Conversion 31 Deep Discriminative Spea...mentioning

confidence: 99%

“…In this paper, to further improve the effectiveness of speaker embedding extracted from only one utterance of an unseen speaker, we propose a deep discriminative speaker encoder. Inspired by [28], first residual network and squeezeand-excitation network [29] are integrated to extract discriminative frame level speaker information by modeling frame-wise and channel-wise interdependence in features. Then attention mechanism is introduced to give different weights to frame level speaker information.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

Du¹,

Xie²

2021

Preprint

View full text Add to dashboard Cite

One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training. However, available one-shot voice conversion approaches are not stable for unseen speakers as the speaker embedding extracted from one utterance of an unseen speaker is not reliable. In this paper, we propose a deep discriminative speaker encoder to extract speaker embedding from one utterance more effectively. Specifically, the speaker encoder first integrates residual network and squeeze-and-excitation network to extract discriminative speaker information in frame level by modeling framewise and channel-wise interdependence in features. Then attention mechanism is introduced to further emphasize speaker related information via assigning different weights to frame level speaker information. Finally a statistic pooling layer is used to aggregate weighted frame level speaker information to form utterance level speaker embedding. The experimental results demonstrate that our proposed speaker encoder can improve the robustness of one-shot voice conversion for unseen speakers and outperforms baseline systems in terms of speech quality and speaker similarity.

show abstract

Section: Robust One-shot Voice Conversion 31 Deep Discriminative Spea...mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

Du¹,

Xie²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…S PEAKER embedding is widely used as a front-end processing for speaker discriminative information extraction for speech application systems where speaker information is needed, for example, speaker verification systems in authentication for security access [1], speaker diarization systems in real-time meeting recordings and/or dialogs [2], [3]. Due to the success of deep learning frameworks in speech and image processing, speaker embedding algorithms have been proposed in which outputs of bottleneck layers could be used as speaker representation.…”

Section: Introductionmentioning

confidence: 99%

TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding

Zhang¹,

Jiang²,

Lu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Speaker embedding is an important front-end module to explore discriminative speaker features (e.g., X-vector) for many speech applications where speaker information is needed. Current state-of-the-art backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation (e.g., ECAPA-TDNN). However, naively adding many branches of multi-scale features with the simple fully convolutional operation could not efficiently improve the performance due to the rapid increase of model parameters and computational complexity. Therefore, in the most current state-of-the-art network architectures, only a few branches corresponding to a limited number of temporal scales could be designed for speaker embeddings. To address this problem, in this paper, we propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs. The new model is based on the conventional time-delay neural network (TDNN), where the network architecture is smartly separated into two modeling operators: a channel-modeling operator and a temporal multi-branch modeling operator. Adding temporal multi-scale in the temporal multi-branch operator needs only a little bit increase of the number of parameters, and thus save more computational budget for adding more branches with large temporal scales. Moreover, after the model was trained, in the inference stage, we further developed a systemic reparameterization method to convert the multi-branch network topology into a single-path-based topology in order to increase inference speed. We investigated the performance of the new TMS method for automatic speaker verification (ASV) on indomain (VoxCeleb) and out-of-domain (CNCeleb) conditions. Results show that the model based on the TMS method obtained a significant increase in the performance over the state-of-the-art ASV models, i.e., ECAPA-TDNN, and meanwhile, had a better model generalization. Moreover, the proposed model achieved a 29% -46% speed up in inference compared to the state-of-theart ECAPA-TDNN.

show abstract

“…Timedelay neural network (TDNN) [5,6,7,8,9] and one-dimensional convolutional neural network (CNN) along the time axis, are representative frame-level structures. The segment-level structure [10,11,12,13,14,15] consider the input acoustic features as a grayscale image which has three dimensions for time, frequency and channel, respectively, and employ two-dimensional CNN to produce three-dimensional outputs. In segment-level structure, with the downsampling operation, the dimension number of time and frequency dimensions will decrease along with the increase of channel dimensions.…”

Section: Introductionmentioning

confidence: 99%