Training Utterance-level Embedding Networks for Speaker Identification and Verification

Park, Heewoong; Cho, Sukhyun; Park, Kyubyong; Kim, Namju; Park, Jonghun

doi:10.21437/interspeech.2018-1044

Cited by 5 publications

(5 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [25], a long short-term memory (LSTM) architecture was applied on MFCC, resulting in an embedding used to verify the speaker of the utterance by means of cosine distance. Other attempts, such as the model proposed in [23] and [31], have used the LSTM architecture as an intermediate tool in extracting i-vectors. More sophisticated models combining CNN and RNN-based solutions were proposed in [26] and [24], applying several convolution layers in between the MFCC input and the RNN.…”

Section: Related Workmentioning

confidence: 99%

“…Recurrent neural networks (RNN) have been utilized in a number of studies. Recently, RNN models were employed in [23,31,25] with mel-frequency cepstrum coefficients (MFCC) as inputs. In [25], a long short-term memory (LSTM) architecture was applied on MFCC, resulting in an embedding used to verify the speaker of the utterance by means of cosine distance.…”

Section: Related Workmentioning

confidence: 99%

“…Emergence of challenging datasets such as Speaker In The Wild (SITW) [11] and its extended variations such as VoxCeleb1 and VoxCeleb2 [1,3], with more than 5,000 speakers and one million utterances, have enabled the opportunity to tackle speaker recognition in real-world scenarios. As a result, a number of deep learning solutions have been proposed for this purpose, including the models studied in [12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Deep Neural Network for Short-Segment Speaker Recognition

Hajavi¹,

Etemad²

2019

Interspeech 2019

View full text Add to dashboard Cite

Today's interactive devices such as smart-phone assistants and smart speakers often deal with short-duration speech segments. As a result, speaker recognition systems integrated into such devices will be much better suited with models capable of performing the recognition task with short-duration utterances. In this paper, a new deep neural network, UtterIdNet, capable of performing speaker recognition with short speech segments is proposed. Our proposed model utilizes a novel architecture that makes it suitable for short-segment speaker recognition through an efficiently increased use of information in short speech segments. UtterIdNet has been trained and tested on the VoxCeleb datasets, the latest benchmarks in speaker recognition. Evaluations for different segment durations show consistent and stable performance for short segments, with significant improvement over the previous models for segments of 2 seconds, 1 second, and especially sub-second durations (250 ms and 500 ms).

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Deep Neural Network for Short-Segment Speaker Recognition

Hajavi¹,

Etemad²

2019

Interspeech 2019

View full text Add to dashboard Cite

show abstract

“…The goal of speaker recognition is to recognize a speaker from the characteristics of voices (Bai, Zhang, & Chen, 0000;Poddar, Sahidullah, & Saha, 2017). Representing the speaker properties into low dimensional feature space is beneficial for many downstream tasks, and such compact representations used to distinguish speakers (speaker embedding) have been an attractive topic and is widely used in some studies, such as speaker identification (Park, Cho, Park, Kim, & Park, 2018), verification (Le & Odobez, 2018;Novoselov, Shulipa, Kremnev, Kozlov, & Shchemelinin, 2018;Snyder, Garcia-Romero, Povey, & Khudanpur, 2017), detection (McLaren, Castan, Nandwana, Ferrer, & Yilmaz, 2018), segmentation (Garcia-Romero, Snyder, Sell, Povey, & McCree, 2017;Wang, Downey, Wan, Mansfield and Moreno, 2018), and speaker dependent speech enhancement (Chuang, Wang, Hung, Tsao, & Fang, 2019;Gao et al, 2015).…”

Section: Introductionmentioning

confidence: 99%

H-VECTORS: Improving the robustness in utterance-level speaker embeddings using a hierarchical attention model

2021

View full text Add to dashboard Cite

“…The generation of compact representation used to distinguish speakers has been an attractive topic and widely used in some related studies, such as speaker identification [1], verification [2,3,4], detection [5], segmentation [6,7], and speaker dependent speech enhancement [8,9].…”

Section: Introductionmentioning

confidence: 99%

H-Vectors: Utterance-Level Speaker Embedding Using a Hierarchical Attention Model

Shi

Huang

Hain

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, a hierarchical attention network to generate utterance-level embeddings (H-vectors) for speaker identification is proposed. Since different parts of an utterance may have different contributions to speaker identities, the use of hierarchical structure aims to learn speaker related information locally and globally. In the proposed approach, frame-level encoder and attention are applied on segments of an input utterance and generate individual segment vectors. Then, segment level attention is applied on the segment vectors to construct an utterance representation. To evaluate the effectiveness of the proposed approach, NIST SRE 2008 Part1 dataset is used for training, and two datasets, Switchboard Cellular part1 and CallHome American English Speech, are used to evaluate the quality of extracted utterance embeddings on speaker identification and verification tasks. In comparison with two baselines, X-vector, X-vector+Attention, the obtained results show that H-vectors can achieve a significantly better performance. Furthermore, the extracted utterance-level embeddings are more discriminative than the two baselines when mapped into a 2D space using t-SNE.

show abstract

Training Utterance-level Embedding Networks for Speaker Identification and Verification

Cited by 5 publications

References 17 publications

A Deep Neural Network for Short-Segment Speaker Recognition

A Deep Neural Network for Short-Segment Speaker Recognition

H-VECTORS: Improving the robustness in utterance-level speaker embeddings using a hierarchical attention model

H-Vectors: Utterance-Level Speaker Embedding Using a Hierarchical Attention Model

Contact Info

Product

Resources

About