Investigation of Frame Alignments for GMM-based Digit-prompted Speaker Verification

Liu, Yi; He, Liang; Zhang, Weiqiang; Liu, Jia; Johnson, Michael T.

doi:10.23919/apsipa.2018.8659790

Cited by 4 publications

(2 citation statements)

References 23 publications

(44 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most of machine learning is shallow model (such as GMM, HMM) [23][24] [25], which means weak nonlinear transformation ability, so it is not enough to describe the complex highdimensional features of speech. Since the input of GMM is a single frame, the influence of copronunciation is ignored, so we use the spliced frame as the input of the neural network to model the observation probability.…”

Section: Backbonementioning

confidence: 99%

A New Approach for Speech Keyword Spotting in Noisy Environment

Ye¹,

Huang²

2022

Artificial Intelligence, Soft Computing and Applications

View full text Add to dashboard Cite

Keyword Spotting works to detect wake-up keywords in a continuous voice stream, which is widely used in products such as mobile devices and smart home. Recently, DNNs dominate keyword spotting and dramatically improve performance. However, few researchers concerned about noise in speech keyword recognition. Thus, we propose an architecture for the detection under noisy scenario. Our framework combines attention mechanism and residual structure based on the CNN backbone. In addition, we use separable convolution to reduce the number of model’s parameters, which makes it applicable in the embedded devices. Noises from various scenes are utilized for data augmentation to boost performance. The proposed method achieves an accuracy of 94.93% on the noisy test set based on the Google Speech Commands dataset. We also compare performance between the proposed method and RNN-based algorithm, and prove our model achieve higher accuracy with fewer parameters.

show abstract

Section: Backbonementioning

confidence: 99%

A New Approach for Speech Keyword Spotting in Noisy Environment

Ye¹,

Huang²

2022

Artificial Intelligence, Soft Computing and Applications

View full text Add to dashboard Cite

show abstract

“…Textdependent SV task allows us to compare utterances of the same phonetic context [6], [7], or random word sequences coming from a fixed vocabulary [8], [9]. With random sequences, such as random digit strings, an SV system is less vulnerable to replay attacks [8], [10], [11]. In this work, we study a neural acoustic-phonetic approach for SV of random digit strings in RSR2015 Part III database [12].…”

Section: Introductionmentioning

confidence: 99%

Neural Acoustic-Phonetic Approach for Speaker Verification With Phonetic Attention Mask

Liu

Das²,

Lee

et al. 2022

IEEE Signal Process. Lett.

View full text Add to dashboard Cite

Traditional acoustic-phonetic approach makes use of both spectral and phonetic information when comparing the voice of speakers. While phonetic units are not equally informative, the phonetic context of speech plays an important role in speaker verification (SV). In this paper, we propose a neural acousticphonetic approach that learns to dynamically assign differentiated weights to spectral features for SV. Such differentiated weights form a phonetic attention mask (PAM). The neural acoustic-phonetic framework consists of two training pipelines, one for SV and another for speech recognition. Through the PAM, we leverage the phonetic information for SV. We evaluate the proposed neural acoustic-phonetic framework on the RSR2015 database Part III corpus, that consists of random digit strings. We show that the proposed framework with PAM consistently outperforms baseline with an equal error rate reduction of 13.45% and 10.20% for female and male data, respectively.

show abstract