“…Previous studies in depression prediction using speech [15,16] have shown the superiority of MFCCs over other audio based features like extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) [7] and DEEP SPECTRUM features [1]. Huang et al [9] showed with their depression classification study that coordination features computed from MFCCs perform better with respect to formants and eGeMAPS features. So to compare how robust and effective the TVs are for detecting schizophrenia, we chose MFCCs as the baseline audio features for our study.…”
“…Huang et al [9] in a recent study with MDD introduces a new channel delay correlation method inspired by TDEC, which uses a different correlation structure with correlations starting from 0 to a delay of 'D' frames (a design choice). The delayed autocorrelations and cross-correlations across channels are stacked to form the FVTC correlation structure.…”
Section: Full Vocal Tract Coordination (Fvtc)mentioning
confidence: 99%
“…We designed a CNN model inspired by the one in [9] which takes the FVTC correlation matrix computed in section 3.2 as the input.…”
Section: Fvtc Cnn Model (Fvtc-cnn) : Modelmentioning
confidence: 99%
“…Time-delay embedded correlation (TDEC) analysis has shown promising results in assessing neuromotor coordination in Major Depressive Disorder (MDD), and the eigenspectra derived from the correlation matrices have been used effectively for classification of MDD subjects from healthy [17,22,24]. Recently, new multi-scale full vocal tract coordination (FVTC) features generated with a dilated CNN have shown further improvement in classification for selected datasets of MDD subjects [9]. The FVTC method addresses repetitive sampling and matrix discontinuity issues of TDEC analysis by introducing a new channel-delay correlation matrix.…”
This study investigates the speech articulatory coordination in schizophrenia subjects exhibiting strong positive symptoms (e.g. hallucinations and delusions), using two distinct channel-delay correlation methods. We show that the schizophrenic subjects with strong positive symptoms and who are markedly ill pose complex articulatory coordination pattern in facial and speech gestures than what is observed in healthy subjects. This distinction in speech coordination pattern is used to train a multimodal convolutional neural network (CNN) which uses video and audio data during speech to distinguish schizophrenic patients with strong positive symptoms from healthy subjects. We also show that the vocal tract variables (TVs) which correspond to place of articulation and glottal source outperform the Mel-frequency Cepstral Coefficients (MFCCs) when fused with Facial Action Units (FAUs) in the proposed multimodal network. For the clinical dataset we collected, our best performing multimodal network improves the mean F1 score for detecting schizophrenia by around 18% with respect to the full vocal tract coordination (FVTC) baseline method implemented with fusing FAUs and MFCCs.
CCS CONCEPTS• Computing methodologies → Neural networks; • Social and professional topics → People with disabilities.
“…Previous studies in depression prediction using speech [15,16] have shown the superiority of MFCCs over other audio based features like extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) [7] and DEEP SPECTRUM features [1]. Huang et al [9] showed with their depression classification study that coordination features computed from MFCCs perform better with respect to formants and eGeMAPS features. So to compare how robust and effective the TVs are for detecting schizophrenia, we chose MFCCs as the baseline audio features for our study.…”
“…Huang et al [9] in a recent study with MDD introduces a new channel delay correlation method inspired by TDEC, which uses a different correlation structure with correlations starting from 0 to a delay of 'D' frames (a design choice). The delayed autocorrelations and cross-correlations across channels are stacked to form the FVTC correlation structure.…”
Section: Full Vocal Tract Coordination (Fvtc)mentioning
confidence: 99%
“…We designed a CNN model inspired by the one in [9] which takes the FVTC correlation matrix computed in section 3.2 as the input.…”
Section: Fvtc Cnn Model (Fvtc-cnn) : Modelmentioning
confidence: 99%
“…Time-delay embedded correlation (TDEC) analysis has shown promising results in assessing neuromotor coordination in Major Depressive Disorder (MDD), and the eigenspectra derived from the correlation matrices have been used effectively for classification of MDD subjects from healthy [17,22,24]. Recently, new multi-scale full vocal tract coordination (FVTC) features generated with a dilated CNN have shown further improvement in classification for selected datasets of MDD subjects [9]. The FVTC method addresses repetitive sampling and matrix discontinuity issues of TDEC analysis by introducing a new channel-delay correlation matrix.…”
This study investigates the speech articulatory coordination in schizophrenia subjects exhibiting strong positive symptoms (e.g. hallucinations and delusions), using two distinct channel-delay correlation methods. We show that the schizophrenic subjects with strong positive symptoms and who are markedly ill pose complex articulatory coordination pattern in facial and speech gestures than what is observed in healthy subjects. This distinction in speech coordination pattern is used to train a multimodal convolutional neural network (CNN) which uses video and audio data during speech to distinguish schizophrenic patients with strong positive symptoms from healthy subjects. We also show that the vocal tract variables (TVs) which correspond to place of articulation and glottal source outperform the Mel-frequency Cepstral Coefficients (MFCCs) when fused with Facial Action Units (FAUs) in the proposed multimodal network. For the clinical dataset we collected, our best performing multimodal network improves the mean F1 score for detecting schizophrenia by around 18% with respect to the full vocal tract coordination (FVTC) baseline method implemented with fusing FAUs and MFCCs.
CCS CONCEPTS• Computing methodologies → Neural networks; • Social and professional topics → People with disabilities.
“…Sample features include voice quality [17] [16], articulation [18] [19] [20], speech rate [19], and spectral [9] features. Advances in deep learning [21] have led to improved results in a range of affective and behavioral health tasks [22][23] [24][25] [26][27] [28]. In deep learning the focus is to learn feature representation from data.…”
Speech-based algorithms have gained interest for the management of behavioral health conditions such as depression. We explore a speech-based transfer learning approach that uses a lightweight encoder and that transfers only the encoder weights, enabling a simplified run-time model. Our study uses a large data set containing roughly two orders of magnitude more speakers and sessions than used in prior work. The large data set enables reliable estimation of improvement from transfer learning. Results for the prediction of PHQ-8 labels show up to 27% relative performance gains for binary classification; these gains are statistically significant with a p-value close to zero. Improvements were also found for regression. Additionally, the gain from transfer learning does not appear to require strong source task performance. Results suggest that this approach is flexible and offers promise for efficient implementation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.