Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2713
|View full text |Cite
|
Sign up to set email alerts
|

State-of-the-Art Speaker Recognition for Telephone and Video Speech: The JHU-MIT Submission for NIST SRE18

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
60
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 78 publications
(60 citation statements)
references
References 13 publications
0
60
0
Order By: Relevance
“…Then, we explain the transfer learning approach followed to perform SER. It is shown in the literature that i-vectors [17], speaker diarization [18,19,20,21]. In this work, we only exploit the x-vector model because of its superiority over i-vectors [22] and also because it is easy to adapt for down-stream tasks.…”
Section: Our Approachmentioning
confidence: 99%
See 1 more Smart Citation
“…Then, we explain the transfer learning approach followed to perform SER. It is shown in the literature that i-vectors [17], speaker diarization [18,19,20,21]. In this work, we only exploit the x-vector model because of its superiority over i-vectors [22] and also because it is easy to adapt for down-stream tasks.…”
Section: Our Approachmentioning
confidence: 99%
“…In this paper, we used state-of-the-art ResNet x-vector model reported in [17] for utterance level speaker embedding extraction. The network consisted of three parts: frame-level representation learning network, pooling network, and utterance-level classifier.…”
Section: X-vector Modelmentioning
confidence: 99%
“…There are three components in a typical end-to-end speaker recognition system: an encoder network, a statistical pooling layer, and a classifier [16]. An encoder network acts as a frame-level feature extractor, the statistical pooling layer summarizes frame-level representations to a fixed-dimensional utterance-level embedding, and the classifier determines the speaker identity based on the embedding.…”
Section: Neural Speaker Embeddingsmentioning
confidence: 99%
“…One aspect of our study is therefore an attempt to find out how effective these recent developments in speaker verification are for speaker adaption in TTS. More specifically we investigate the capability of neural speaker embeddings [16,17,19] to capture and model characteristics of speakers that were unseen during TTS model training. For this purpose, we extend an improved Tacotron system in [28] to a multi-speaker TTS system and conduct systematic analysis to answer the above question.…”
Section: Introductionmentioning
confidence: 99%
“…Text independent speaker verification aims to verify whether an utterance is pronounced by a hypothesized speaker according to his/her pre-recorded utterances without limiting the speech contents. The state-of-the-art text-independent speaker verification systems [1][2][3][4] use deep neural networks (DNNs) to project speech recordings with different lengths into a common low dimensional embedding space where the speakers' identities are represented. Such a method is called deep embedding, where the embedding networks have three key components-network structure [1,3,[5][6][7], pooling layer [1,[8][9][10][11][12], and loss function [13][14][15][16][17].…”
Section: Introductionmentioning
confidence: 99%