2019
DOI: 10.3390/app9163295
|View full text |Cite
|
Sign up to set email alerts
|

Supervector Extraction for Encoding Speaker and Phrase Information with Neural Networks for Text-Dependent Speaker Verification

Abstract: In this paper, we propose a new differentiable neural network with an alignment mechanism for text-dependent speaker verification. Unlike previous works, we do not extract the embedding of an utterance from the global average pooling of the temporal dimension. Our system replaces this reduction mechanism by a phonetic phrase alignment model to keep the temporal structure of each phrase since the phonetic information is relevant in the verification task. Moreover, we can apply a convolutional neural network as … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2

Relationship

2
5

Authors

Journals

citations
Cited by 11 publications
(12 citation statements)
references
References 24 publications
0
11
0
Order By: Relevance
“…As we showed in our previous work [5], keeping the order of the phonetic information is important for text-dependent tasks due to the lexical content, since this information is part of the identity. DNN models using standard average pooling mechanisms to transform the processed utterance information to an embedding vector can have problems for this task.…”
Section: Introductionmentioning
confidence: 79%
See 1 more Smart Citation
“…As we showed in our previous work [5], keeping the order of the phonetic information is important for text-dependent tasks due to the lexical content, since this information is part of the identity. DNN models using standard average pooling mechanisms to transform the processed utterance information to an embedding vector can have problems for this task.…”
Section: Introductionmentioning
confidence: 79%
“…The multi-head attention layers can be seen analogous to an alignment method which allows assigning embeddings to several categories. This approach has been found useful for textdependent tasks [5,19]. In addition, we introduce in our architecture memory layers that can store a significant amount of information for a relatively small inference computing cost.…”
Section: Meanmentioning
confidence: 99%
“…Speech classification and identification is a research area that has consistently been highly represented in IberSPEECH conferences, including the 2018 edition. This special issue includes two papers from ViVoLab [4,5]. In these works, the authors investigate how to use phonetic information for the tasks of short sentence verification and text-independent speaker verification using Neural Networks.…”
Section: Speaker Verification and Identificationmentioning
confidence: 99%
“…Mingote et al [5] propose an architecture to include phonetic information in text-dependent verification systems using Neural Networks. DNN has improved significantly many speaker verification tasks.…”
Section: Speaker Verification and Identificationmentioning
confidence: 99%
“…They applied convolutional neural networks (CNNs) to the front end and learn the neural network that produces super-vectors of each word, whose pronunciation and syntax are distinguished at the same time. This choice has the advantage that super-vectors encode phrases and speaker information, which showed good performance in text-dependent speaker verification tasks [11].…”
Section: Related Workmentioning
confidence: 99%