Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1044
|View full text |Cite
|
Sign up to set email alerts
|

Training Utterance-level Embedding Networks for Speaker Identification and Verification

Abstract: Encoding speaker-specific characteristics from speech signals into fixed length vectors is a key component of speaker identification and verification systems. This paper presents a deep neural network architecture for speaker embedding models where similarity in embedded utterance vectors explicitly approximates the similarity in vocal patterns of speakers. The proposed architecture contains an additional speaker embedding lookup table to compute loss based on embedding similarities. Furthermore, we propose a … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
4
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 17 publications
0
4
0
Order By: Relevance
“…In [25], a long short-term memory (LSTM) architecture was applied on MFCC, resulting in an embedding used to verify the speaker of the utterance by means of cosine distance. Other attempts, such as the model proposed in [23] and [31], have used the LSTM architecture as an intermediate tool in extracting i-vectors. More sophisticated models combining CNN and RNN-based solutions were proposed in [26] and [24], applying several convolution layers in between the MFCC input and the RNN.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…In [25], a long short-term memory (LSTM) architecture was applied on MFCC, resulting in an embedding used to verify the speaker of the utterance by means of cosine distance. Other attempts, such as the model proposed in [23] and [31], have used the LSTM architecture as an intermediate tool in extracting i-vectors. More sophisticated models combining CNN and RNN-based solutions were proposed in [26] and [24], applying several convolution layers in between the MFCC input and the RNN.…”
Section: Related Workmentioning
confidence: 99%
“…Recurrent neural networks (RNN) have been utilized in a number of studies. Recently, RNN models were employed in [23,31,25] with mel-frequency cepstrum coefficients (MFCC) as inputs. In [25], a long short-term memory (LSTM) architecture was applied on MFCC, resulting in an embedding used to verify the speaker of the utterance by means of cosine distance.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The goal of speaker recognition is to recognize a speaker from the characteristics of voices (Bai, Zhang, & Chen, 0000;Poddar, Sahidullah, & Saha, 2017). Representing the speaker properties into low dimensional feature space is beneficial for many downstream tasks, and such compact representations used to distinguish speakers (speaker embedding) have been an attractive topic and is widely used in some studies, such as speaker identification (Park, Cho, Park, Kim, & Park, 2018), verification (Le & Odobez, 2018;Novoselov, Shulipa, Kremnev, Kozlov, & Shchemelinin, 2018;Snyder, Garcia-Romero, Povey, & Khudanpur, 2017), detection (McLaren, Castan, Nandwana, Ferrer, & Yilmaz, 2018), segmentation (Garcia-Romero, Snyder, Sell, Povey, & McCree, 2017;Wang, Downey, Wan, Mansfield and Moreno, 2018), and speaker dependent speech enhancement (Chuang, Wang, Hung, Tsao, & Fang, 2019;Gao et al, 2015).…”
Section: Introductionmentioning
confidence: 99%
“…The generation of compact representation used to distinguish speakers has been an attractive topic and widely used in some related studies, such as speaker identification [1], verification [2,3,4], detection [5], segmentation [6,7], and speaker dependent speech enhancement [8,9].…”
Section: Introductionmentioning
confidence: 99%