2020
DOI: 10.1016/j.neucom.2020.06.045
|View full text |Cite
|
Sign up to set email alerts
|

Deep multi-metric learning for text-independent speaker verification

Abstract: Text-independent speaker verification is an important artificial intelligence problem that has a wide spectrum of applications, such as criminal investigation, payment certification, and interest-based customer services. The purpose of text-independent speaker verification is to determine whether two given uncontrolled utterances originate from the same speaker or not. Extracting speech features for each speaker using deep neural networks is a promising direction to explore and a straightforward solution is to… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
8
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 27 publications
(8 citation statements)
references
References 49 publications
(66 reference statements)
0
8
0
Order By: Relevance
“…A ResNet-SE block mainly consists of convolution layers. Filters in the convolution layer explicitly model local features and allow spatial translation invariance, which make convolution layer suitable to extract frame level features [28]. SE block expands the temporal context of the frame level information by modeling channel interdependence in features, which has been verified to be helpful in speaker verification task [28].…”
Section: Robust One-shot Voice Conversion 31 Deep Discriminative Spea...mentioning
confidence: 99%
See 1 more Smart Citation
“…A ResNet-SE block mainly consists of convolution layers. Filters in the convolution layer explicitly model local features and allow spatial translation invariance, which make convolution layer suitable to extract frame level features [28]. SE block expands the temporal context of the frame level information by modeling channel interdependence in features, which has been verified to be helpful in speaker verification task [28].…”
Section: Robust One-shot Voice Conversion 31 Deep Discriminative Spea...mentioning
confidence: 99%
“…In this paper, to further improve the effectiveness of speaker embedding extracted from only one utterance of an unseen speaker, we propose a deep discriminative speaker encoder. Inspired by [28], first residual network and squeezeand-excitation network [29] are integrated to extract discriminative frame level speaker information by modeling frame-wise and channel-wise interdependence in features. Then attention mechanism is introduced to give different weights to frame level speaker information.…”
Section: Introductionmentioning
confidence: 99%
“…S PEAKER embedding is widely used as a front-end processing for speaker discriminative information extraction for speech application systems where speaker information is needed, for example, speaker verification systems in authentication for security access [1], speaker diarization systems in real-time meeting recordings and/or dialogs [2], [3]. Due to the success of deep learning frameworks in speech and image processing, speaker embedding algorithms have been proposed in which outputs of bottleneck layers could be used as speaker representation.…”
Section: Introductionmentioning
confidence: 99%
“…Timedelay neural network (TDNN) [5,6,7,8,9] and one-dimensional convolutional neural network (CNN) along the time axis, are representative frame-level structures. The segment-level structure [10,11,12,13,14,15] consider the input acoustic features as a grayscale image which has three dimensions for time, frequency and channel, respectively, and employ two-dimensional CNN to produce three-dimensional outputs. In segment-level structure, with the downsampling operation, the dimension number of time and frequency dimensions will decrease along with the increase of channel dimensions.…”
Section: Introductionmentioning
confidence: 99%