A robust BFCC feature extraction for ASR system

Kuan, Ta-Wen; Tsai, An-Chao; Sung, Po-Hsun; Wang, Jhing-Fa; Kuo, Hsien-Shun

doi:10.5430/air.v5n2p14

Cited by 7 publications

(3 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [31], authors used GTCC feature and pitch at front-end for feature extraction, and passed these features to GMM and KNN to improve the performance of ASV system. Kaun et al [10] applied auditory based BFCC features with AURORA 2 dataset, and compared these features' performance with MFCC using HMM model. Noroozi et al, [32] suggested a methodology for emotion recognition using audio.…”

Section: A Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Spoof Detection using Sequentially Integrated Image and Audio Features

Chakravarty¹,

Dua²

2023

IJCDS

View full text Add to dashboard Cite

Analyzing the intricate nature of an audio signal often requires the extraction of relevant features, which serve as informative descriptors of the signal. It entails studying the signal and determining how signals are related to one another. As a result, the performance of audio spoofing detection in Automatic Speaker Verification (ASV) systems is strongly reliant on front-end feature extraction. In this paper, three types of successively integrated features have been proposed. First, Acoustic Ternary Pattern (ATP) image features are sequentially fused with different audio features such as MFCC, CQCC, GTCC, BFCC and PLP, individually. Second, LBP image features are combined with all these audio features similarly. Then, the sequential integration of ATP-LBP features is combined individually with MFCC, CQCC, GTCC, BFCC and PLP features. Finally, these front-end hybrid feature sets are classified using different ML and deep learning algorithms based acoustic models at the back-end. The state-of-the-art ASVspoof 2019 dataset has been used to implement various front-end and back-end combinations. The research outcomes reveal that the proposed approach achieved the best results with ATP-LBP-GTCC at the front end with LSTM-based acoustic model at the back-end.

show abstract

Section: A Related Workmentioning

confidence: 99%

“…Hence, researchers tried to modify these techniques to make these noise robust. The other approach to handle the noise during feature extraction is to use features that are already noise robust such as GTCC [8], [9] and BFCC [10], [11]. GTCC employs a non-linear gammatone filter bank [12].…”

Section: Introductionmentioning

confidence: 99%

Spoof Detection using Sequentially Integrated Image and Audio Features

Chakravarty¹,

Dua²

2023

IJCDS

View full text Add to dashboard Cite

show abstract

“…The proposed feature set contains many attributes computed in time and frequency domain. The feature space includes the energy of the signal, fundamental frequency (F0) (Boersma, Paul, 1993; Boersma, Weenink, 2001), linear prediction coefficients (LPC) (Markel, Gray, 1976), linear predictive cepstral coefficients (LPCC) (Rao et al, 2015), Mel frequency cepstral coefficients (MFCC) (Davis, Mermelstein, 1980), and bark frequency cepstral coefficients (BFCC) (Kuan et al, 2016). The selection of fundamental frequency for the whole spoken sentence seems to be the most promising part of the feature space.…”

Section: Feature Space Analysismentioning

confidence: 99%

Archives of Acoustics

Smietanka

Maka

2021

View full text Add to dashboard Cite

An analysis of low-level feature space for emotion recognition from the speech is presented. The main goal was to determine how the statistical properties computed from contours of low-level features influence the emotion recognition from speech signals. We have conducted several experiments to reduce and tune our initial feature set and to configure the classification stage. In the process of analysis of the audio feature space, we have employed the univariate feature selection using the chi-squared test. Then, in the first stage of classification, a default set of parameters was selected for every classifier. For the classifier that obtained the best results with the default settings, the hyperparameter tuning using cross-validation was exploited. In the result, we compared the classification results for two different languages to find out the difference between emotional states expressed in spoken sentences. The results show that from an initial feature set containing 3198 attributes we have obtained the dimensionality reduction about 80% using feature selection algorithm. The most dominant attributes selected at this stage based on the mel and bark frequency scales filterbanks with its variability described mainly by variance, median absolute deviation and standard and average deviations. Finally, the classification accuracy using tuned SVM classifier was equal to 72.5% and 88.27% for emotional spoken sentences in Polish and German languages, respectively.

show abstract