Supervector Extraction for Encoding Speaker and Phrase Information with Neural Networks for Text-Dependent Speaker Verification

Mingote, Victoria; Miguel, Antonio; Giménez, Alfonso Ortega; Lleida, Eduardo

doi:10.3390/app9163295

Cited by 11 publications

(12 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As we showed in our previous work [5], keeping the order of the phonetic information is important for text-dependent tasks due to the lexical content, since this information is part of the identity. DNN models using standard average pooling mechanisms to transform the processed utterance information to an embedding vector can have problems for this task.…”

Section: Introductionmentioning

confidence: 79%

“…The multi-head attention layers can be seen analogous to an alignment method which allows assigning embeddings to several categories. This approach has been found useful for textdependent tasks [5,19]. In addition, we introduce in our architecture memory layers that can store a significant amount of information for a relatively small inference computing cost.…”

Section: Meanmentioning

confidence: 99%

See 1 more Smart Citation

Memory Layers with Multi-Head Attention Mechanisms for Text-Dependent Speaker Verification

Mingote

Miguel

Giménez

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

In this paper, we explore an approach based on memory layers and multi-head attention mechanisms to improve in an efficient way the performance of text-dependent speaker verification (SV) systems. The most extended SV systems based on Deep Neural Networks (DNN) extract the embedding of the utterance from the average pooling of the temporal dimension after processing. Unlike previous works, we can exploit the phonetic knowledge needed for text-dependent SV systems by combining the temporal attention of multiple parallel heads with the phonetic embeddings extracted from a phonetic classification network, which helps to guide to the attention mechanism with the role of the positional embedding. The addition of a memory layer to a text-dependent SV system was tested on the RSR2015-part II and DeepMine-part I databases, where, in both cases outperformed the baseline result and the reference system based on the same transformer network without the memory layer.

show abstract

Section: Introductionmentioning

confidence: 79%

Section: Meanmentioning

confidence: 99%

Memory Layers with Multi-Head Attention Mechanisms for Text-Dependent Speaker Verification

Mingote

Miguel

Giménez

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Speech classification and identification is a research area that has consistently been highly represented in IberSPEECH conferences, including the 2018 edition. This special issue includes two papers from ViVoLab [4,5]. In these works, the authors investigate how to use phonetic information for the tasks of short sentence verification and text-independent speaker verification using Neural Networks.…”

Section: Speaker Verification and Identificationmentioning

confidence: 99%

“…Mingote et al [5] propose an architecture to include phonetic information in text-dependent verification systems using Neural Networks. DNN has improved significantly many speaker verification tasks.…”

Section: Speaker Verification and Identificationmentioning

confidence: 99%

Editorial for Special Issue “IberSPEECH2018: Speech and Language Technologies for Iberian Languages”

2020

View full text Add to dashboard Cite

The main goal of this Special Issue is to present the latest advances in research and novel applications of speech and language technologies based on the works presented at the IberSPEECH edition held in Barcelona in 2018, paying special attention to those focused on Iberian languages. IberSPEECH is the international conference of the Special Interest Group on Iberian Languages (SIG-IL) of the International Speech Communication Association (ISCA) and of the Spanish Thematic Network on Speech Technologies (Red Temática en Tecnologías del Habla, or RTTH for short). Several researchers were invited to extend their contributions presented at IberSPEECH2018 due to their interest and quality. As a result, this Special Issue is composed of 13 papers that cover different topics of investigation related to perception, speech analysis and enhancement, speaker verification and identification, speech production and synthesis, natural language processing, together with several applications and evaluation challenges.

show abstract

“…They applied convolutional neural networks (CNNs) to the front end and learn the neural network that produces super-vectors of each word, whose pronunciation and syntax are distinguished at the same time. This choice has the advantage that super-vectors encode phrases and speaker information, which showed good performance in text-dependent speaker verification tasks [11].…”

Section: Related Workmentioning

confidence: 99%

An Audification and Visualization System (AVS) of an Autonomous Vehicle for Blind and Deaf People Based on Deep Learning

Son

Jeong

Lee

2019

Sensors

View full text Add to dashboard Cite

When blind and deaf people are passengers in fully autonomous vehicles, an intuitive and accurate visualization screen should be provided for the deaf, and an audification system with speech-to-text (STT) and text-to-speech (TTS) functions should be provided for the blind. However, these systems cannot know the fault self-diagnosis information and the instrument cluster information that indicates the current state of the vehicle when driving. This paper proposes an audification and visualization system (AVS) of an autonomous vehicle for blind and deaf people based on deep learning to solve this problem. The AVS consists of three modules. The data collection and management module (DCMM) stores and manages the data collected from the vehicle. The audification conversion module (ACM) has a speech-to-text submodule (STS) that recognizes a user’s speech and converts it to text data, and a text-to-wave submodule (TWS) that converts text data to voice. The data visualization module (DVM) visualizes the collected sensor data, fault self-diagnosis data, etc., and places the visualized data according to the size of the vehicle’s display. The experiment shows that the time taken to adjust visualization graphic components in on-board diagnostics (OBD) was approximately 2.5 times faster than the time taken in a cloud server. In addition, the overall computational time of the AVS system was approximately 2 ms faster than the existing instrument cluster. Therefore, because the AVS proposed in this paper can enable blind and deaf people to select only what they want to hear and see, it reduces the overload of transmission and greatly increases the safety of the vehicle. If the AVS is introduced in a real vehicle, it can prevent accidents for disabled and other passengers in advance.

show abstract

Supervector Extraction for Encoding Speaker and Phrase Information with Neural Networks for Text-Dependent Speaker Verification

Cited by 11 publications

References 24 publications

Memory Layers with Multi-Head Attention Mechanisms for Text-Dependent Speaker Verification

Memory Layers with Multi-Head Attention Mechanisms for Text-Dependent Speaker Verification

Editorial for Special Issue “IberSPEECH2018: Speech and Language Technologies for Iberian Languages”

An Audification and Visualization System (AVS) of an Autonomous Vehicle for Blind and Deaf People Based on Deep Learning

Contact Info

Product

Resources

About