jameslyons/python_speech_features: release v0.6.1

Lyons, J. R.; Wang, Darren Yow-Bang; Gianluca,; Shteingart, Hanan; Mavrinac, Erik; Gaurkar, Yash; Watcharawisetkul, Watcharapol; Birch, S.; Zhihe, Lu; Hölzl, J.; Lesinskis, Janis; Almer, H.E.; Lord, Christopher J.; Stark, Adam

doi:10.5281/zenodo.3607820

Cited by 16 publications

(6 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…where 𝑁 𝑓𝑏 is the total number of filters (usually 40) and 𝑁 𝑚𝑓𝑐𝑐 is the number of selected coefficients (usually 13). Among the most notable are Slaney's Auditory toolbox [31], Voicebox for MATLAB [32], and James Lyon's Python_speech_features GitHub resources [33]. In this study, we used Auditory Toolbox for MFCC and Mel filter bank computation.…”

Section: Human Speech Modelsmentioning

confidence: 99%

“…Many libraries have been developed to extract the speech features for Mel, MFCC, PLP, LPC, and other filter banks. Among the most notable are Slaney's Auditory toolbox[31], Voicebox for MATLAB[32], and James Lyon's Python_speech_features GitHub resources[33]. In this study, we used Auditory Toolbox for MFCC and Mel filter bank computation.The human vocal tract can be simulated using formulas of air flow inside tubes with some simplifications.…”

mentioning

confidence: 99%

See 1 more Smart Citation

A novel filter bank design for speech emotion recognition

Parlak

Altun

2022

Preprint

View full text Add to dashboard Cite

In this study, a novel filter bank design is proposed for speech emotion recognition to replace current state-of-the-art MFCC (Mel Filter Cepstral Coefficients) and Mel filter banks. These novel filter banks are considered to have a great impact and pave the way for great developments and improvements over speech emotion recognition applications. Many filter banks have been proposed to model speech recognition applications but these models either contain too many banks or need some cumbersome mathematical operations to compute. MFCC requires the calculation of DCT (Discrete Cosine Transform), and it is also too difficult to interpret the MFCC coefficients. Mel filters are easy to interpret but they contain too many filters. The novel filter banks are faster and easier to compute. Moreover, they can be interpreted better compared to the MFCC and Mel filters. We apply these filter banks with NVIDIA’s CNN model and SVM-SMO classifier to compare them with MFCC and Mel filter banks. We also implement feature selection, data augmentation, and various techniques to combat problems of imbalanced datasets to show the effectiveness of proposed filter banks.

show abstract

Section: Human Speech Modelsmentioning

confidence: 99%

mentioning

confidence: 99%

A novel filter bank design for speech emotion recognition

Parlak

Altun

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Many libraries have been developed to extract the speech features for Mel, MFCC, PLP, LPC, and other filter banks. Among the most notable are Slaney's Auditory toolbox [32], Voicebox for MATLAB [33], and James Lyon's Python_speech_features GitHub resources [34]. In this study, we used Auditory Toolbox for MFCC and Mel filter bank computation.…”

Section: Human Speech Modelsmentioning

confidence: 99%

A novel filter bank design for speech emotion recognition

Parlak

Altun

2022

Preprint

View full text Add to dashboard Cite

In this study, a novel filter bank design is proposed for speech emotion recognition to replace current state-of-the-art MFCC (Mel Filter Cepstral Coefficients) and Mel filter banks. These novel filter banks are considered to have a great impact and pave the way for great developments and improvements over speech emotion recognition applications. Many filter banks have been proposed to model speech recognition applications but these models either contain too many banks or need some cumbersome mathematical operations to compute. MFCC requires the calculation of DCT (Discrete Cosine Transform), and it is also too difficult to interpret the MFCC coefficients. Mel filters are easy to interpret but they contain too many filters. The novel filter banks are faster and easier to compute. Moreover, they can be interpreted better compared to the MFCC and Mel filters. We apply these filter banks with NVIDIA’s CNN and ResNet deep convolutional networks. We also implement feature selection, data augmentation, and various techniques to combat problems of imbalanced datasets to show the effectiveness of proposed filter banks.

show abstract

“…From each audio recording, we extracted 13 melfrequency cepstral coefficients (MFCC 0-12) with a window length of 25 ms and step size of 10 ms using the python_speech_features library. 34 Mel-frequency cepstral coefficients (MFCCs) have been widely used in both speaker recognition, 35 SER, 36 and depression detection, 37 and have several desirable properties such as being independent of the energy of the acoustic signal and robustness across genders. 38,39 MFCCs represent movements of the vocal tract and are designed to mimic how the human ear perceives sounds by having high resolution in the lower frequencies and less in higher frequencies.…”

Section: Feature Extractionmentioning

confidence: 99%

A generalizable speech emotion recognition model reveals depression and remission

Hansen

Zhang

Wolf

et al. 2021

Acta Psychiatr Scand

View full text Add to dashboard Cite

Objective: Affective disorders are associated with atypical voice patterns; however, automated voice analyses suffer from small sample sizes and untested generalizability on external data. We investigated a generalizable approach to aid clinical evaluation of depression and remission from voice using transfer learning: We train machine learning models on easily accessible non-clinical datasets and test them on novel clinical data in a different language. Methods: A Mixture of Experts machine learning model was trained to infer happy/sad emotional state using three publicly available emotional speech corpora in German and US English. We examined the model's predictive ability to classify the presence of depression on Danish speaking healthy controls (N = 42), patients with first-episode major depressive disorder (MDD) (N = 40), and the subset of the same patients who entered remission (N = 25) based on recorded clinical interviews. The model was evaluated on raw, de-noised, and speakerdiarized data. Results:The model showed separation between healthy controls and depressed patients at the first visit, obtaining an AUC of 0.71. Further, speech from patients in remission was indistinguishable from that of the control group. Model predictions were stable throughout the interview, suggesting that 20-30 s of speech might be enough to accurately screen a patient. Background noise (but not speaker diarization) heavily impacted predictions. Conclusion:A generalizable speech emotion recognition model can effectively reveal changes in speaker depressive states before and after remission in patients with MDD. Data collection settings and data cleaning are crucial when considering automated voice analysis for clinical purposes.

show abstract

jameslyons/python_speech_features: release v0.6.1

Cited by 16 publications

References 0 publications

A novel filter bank design for speech emotion recognition

A novel filter bank design for speech emotion recognition

A novel filter bank design for speech emotion recognition

A generalizable speech emotion recognition model reveals depression and remission

Contact Info

Product

Resources

About