How many Mel‐frequency cepstral coefficients to be utilized in speech recognition? A study with the Bengali language

Hasan, Md. Rakibul; Hasan, Md. Mahbub; Hossain, Zakir

doi:10.1049/tje2.12082

Cited by 16 publications

(9 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Inspired by biological neurons, spiking neuron networks (SNNs) are very popular in deep learning (DL). As a widely used neuronal model in SNNs, the Hodgkin-Huxley (HH) model describes the electrical behavior of giant squid axon membranes, and some biological spiking neuron models are based on it [8]. To solve the problem of computationally overloaded HH neuron model, leaky integrate-firing (LIF), regular spikes (RS, also called Izhikevich model), and other neuron models have been proposed.…”

Section: Introductionmentioning

confidence: 99%

Use Brain-Like Audio Features to Improve Speech Recognition Performance

Wang

Zhang

2022

Journal of Sensors

View full text Add to dashboard Cite

Speech recognition plays an important role in the field of human-computer interaction through the use of acoustic sensors, but speech recognition is technically difficult, has complex overall logic, relies heavily on neural network algorithms, and has extremely high technical requirements. In speech recognition, feature extraction is the first step in speech recognition for recovering and extracting speech features. Existing methods, such as Meier spectral coefficients (MFCC) and spectrograms, lose a large amount of acoustic information and lack biological interpretability. Then, for example, existing speech self-supervised representation learning methods based on contrast prediction need to construct a large number of negative samples during training, and their learning effects depend on large batches of training, which requires a large amount of computational resources for the problem. Therefore, in this paper, we propose a new feature extraction method, called SHH (spike-H), that resembles the human brain and achieves higher speech recognition rates than previous methods. The features extracted using the proposed model are subsequently fed into the classification model. We propose a novel parallel CRNN model with an attention mechanism that considers both temporal and spatial features. Experimental results show that the proposed CRNN achieves an accuracy of 94.8% on the Aurora dataset. In addition, audio similarity experiments show that SHH can better distinguish audio features. In addition, the ablation experiments show that SHH is applicable to digital speech recognition.

show abstract

Section: Introductionmentioning

confidence: 99%

Use Brain-Like Audio Features to Improve Speech Recognition Performance

Wang

Zhang

2022

Journal of Sensors

View full text Add to dashboard Cite

show abstract

“…This was then processed using a Hamming Window (20) of length 882, followed by Matlab's AudioFeatureExtractor function to determine the first 18 Mel-Frequency Cepstral Coefficients (MFCCs), which are representations of the power spectrum of the sound (21), for each group. Standard numbers of MFCCs used in similar studies vary between 13 and 25 (22). The number 18 was chosen here in order to match the number of features contributed from each IMU sensor, to ensure the system does not initially weight any one sensor more heavily than the others (weights will be determined and refined during training).…”

Section: Processingmentioning

confidence: 99%

A modular, deep learning-based holistic intent sensing system tested with Parkinson’s disease patients and controls

Russell,

Inches,

Carroll

et al. 2023

Front. Neurol.

View full text Add to dashboard Cite

People living with mobility-limiting conditions such as Parkinson’s disease can struggle to physically complete intended tasks. Intent-sensing technology can measure and even predict these intended tasks, such that assistive technology could help a user to safely complete them. In prior research, algorithmic systems have been proposed, developed and tested for measuring user intent through a Probabilistic Sensor Network, allowing multiple sensors to be dynamically combined in a modular fashion. A time-segmented deep-learning system has also been presented to predict intent continuously. This study combines these principles, and so proposes, develops and tests a novel algorithm for multi-modal intent sensing, combining measurements from IMU sensors with those from a microphone and interpreting the outputs using time-segmented deep learning. It is tested on a new data set consisting of a mix of non-disabled control volunteers and participants with Parkinson’s disease, and used to classify three activities of daily living as quickly and accurately as possible. Results showed intent could be determined with an accuracy of 97.4% within 0.5 s of inception of the idea to act, which subsequently improved monotonically to a maximum of 99.9918% over the course of the activity. This evidence supports the conclusion that intent sensing is viable as a potential input for assistive medical devices.

show abstract

“…Studies on the identification of emotions from Bangla speech data are scarce [4], [24]- [27]. 25 MFCCs were suggested by researchers who investigated the optimum number of MFCCs for emotion recognition in speech data in [4].…”

Section: Motivationmentioning

confidence: 99%

A Machine Learning Approach for Emotion Classification in Bengali Speech

Islam,

Akhi,

Akter

et al. 2023

IJACSA

View full text Add to dashboard Cite

In this research work, we have presented a machine learning strategy for Bengali speech emotion categorization with a focus on Mel-frequency cepstral coefficients (MFCC) as features. The commonly utilized method of MFCC in speech processing has proved effective in obtaining crucial phoneme-specific data. This paper analyzes the efficacy of four machine learning algorithms: Random Forest, XGBoost, CatBoost, and Gradient Boosting, and tackles the paucity of research on emotion categorization in non-English languages, particularly Bengali. With CatBoost obtaining the greatest accuracy of 82.85%, Gradient Boosting coming in second with 81.19%, XGBoost coming in third with 80.03%, and Random Forest coming in fourth with 80.01%, experimental evaluation shows encouraging outcomes. MFCC features improve classification precision and offer insightful information on the distinctive qualities of emotions expressed in Bengali speech. By demonstrating how well MFCC characteristics can identify emotions in Bengali speech, this study advances the field of emotion classification. Future research can investigate more sophisticated feature extraction methods, look into how temporal dynamics are incorporated into emotion classification models, and investigate practical uses for emotion detection systems in Bengali speech. This study advances our knowledge of emotion classification and paves the way for more effective emotion identification systems in Bengali speech by utilizing MFCC and machine learning techniques. Our work addresses the need for thorough and efficient techniques to recognize and classify emotions in speech signals in the context of emotion categorization. Understanding emotions is essential for many applications, as they are a basic component of human communication. By investigating cutting-edge strategies that show promise for enhancing the precision and effectiveness of emotion recognition, this study advances the field of emotion classification.

show abstract

How many Mel‐frequency cepstral coefficients to be utilized in speech recognition? A study with the Bengali language

Cited by 16 publications

References 36 publications

Use Brain-Like Audio Features to Improve Speech Recognition Performance

Use Brain-Like Audio Features to Improve Speech Recognition Performance

A modular, deep learning-based holistic intent sensing system tested with Parkinson’s disease patients and controls

A Machine Learning Approach for Emotion Classification in Bengali Speech

Contact Info

Product

Resources

About