Effectiveness of energy separation-based instantaneous frequency estimation for cochlear cepstral features for synthetic and voice-converted spoofed speech detection

Patil, Ankur T.; Patil, Hemant A.; Khoria, Kuldeep

doi:10.1016/j.csl.2021.101301

Cited by 6 publications

(1 citation statement)

References 69 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many feature sets have been proposed with statistical and deep learning-based classifiers. A few widely used feature sets are as follows: Mel frequency cepstrum coefficients (MFCCs); inverse MFCCs (IMFCCs) [ 15 ]; linear frequency cepstrum coefficients (LFCCs); constant Q cepstrum coefficients (CQCCs) [ 16 ]; log-power spectrum using discrete Fourier transform (DFT) [ 17 ]; Gammatonegram, group delay over the frame, referred to as GD-gram [ 18 ]; modified group delay; All-Pole Group Delay [ 19 ]; Cochlear Filter Cepstral Coefficient—Instantaneous Frequency [ 20 ]; cepstrum coefficients using single-frequency filtering [ 21 , 22 ]; Zero-Time Windowing (ZTW) [ 23 ]; Mel-frequency cepstrum using ZTW [ 24 ]; and polyphase IIR filters [ 25 ]. The human ear uses Fourier transform magnitude and neglects the phase information [ 26 ].…”

Section: Related Workmentioning

confidence: 99%

Gaussian-Filtered High-Frequency-Feature Trained Optimized BiLSTM Network for Spoofed-Speech Classification

Mewada,

Al-Asad,

Almalki

et al. 2023

Sensors

View full text Add to dashboard Cite

Voice-controlled devices are in demand due to their hands-free controls. However, using voice-controlled devices in sensitive scenarios like smartphone applications and financial transactions requires protection against fraudulent attacks referred to as “speech spoofing”. The algorithms used in spoof attacks are practically unknown; hence, further analysis and development of spoof-detection models for improving spoof classification are required. A study of the spoofed-speech spectrum suggests that high-frequency features are able to discriminate genuine speech from spoofed speech well. Typically, linear or triangular filter banks are used to obtain high-frequency features. However, a Gaussian filter can extract more global information than a triangular filter. In addition, MFCC features are preferable among other speech features because of their lower covariance. Therefore, in this study, the use of a Gaussian filter is proposed for the extraction of inverted MFCC (iMFCC) features, providing high-frequency features. Complementary features are integrated with iMFCC to strengthen the features that aid in the discrimination of spoof speech. Deep learning has been proven to be efficient in classification applications, but the selection of its hyper-parameters and architecture is crucial and directly affects performance. Therefore, a Bayesian algorithm is used to optimize the BiLSTM network. Thus, in this study, we build a high-frequency-based optimized BiLSTM network to classify the spoofed-speech signal, and we present an extensive investigation using the ASVSpoof 2017 dataset. The optimized BiLSTM model is successfully trained with the least epoch and achieved a 99.58% validation accuracy. The proposed algorithm achieved a 6.58% EER on the evaluation dataset, with a relative improvement of 78% on a baseline spoof-identification system.

show abstract