FrAUG: A Frame Rate Based Data Augmentation Method for Depression Detection from Speech Signals

Ravi, Vijay; Wang, Jinhan; Flint, Jonathan; Alwan, Abeer

doi:10.48550/arxiv.2202.05912

Cited by 1 publication

(1 citation statement)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Among others, acoustic features such as x-vectors [18], i-vectors [19] and other speaker embeddings [20] have been shown to be effective in the diagnosis of a speaker's mental state. These features, however, also carry information about a speaker's identity [21] which can be counter-productive to privacy preservation-a key factor in the adoption of digital mental-health screening systems [22].…”

Section: Introductionmentioning

confidence: 99%

A Step Towards Preserving Speakers' Identity While Detecting Depression Via Speaker Disentanglement

Ravi¹,

Wang²,

Flint³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Preserving a patient's identity is a challenge for automatic, speech-based diagnosis of mental health disorders. In this paper, we address this issue by proposing adversarial disentanglement of depression characteristics and speaker identity. The model used for depression classification is trained in a speakeridentity-invariant manner by minimizing depression prediction loss and maximizing speaker prediction loss during training. The effectiveness of the proposed method is demonstrated on two datasets -DAIC-WOZ (English) and CONVERGE (Mandarin), with three feature sets (Mel-spectrograms, raw-audio signals, and the last-hidden-state of Wav2vec2.0), using a modified DepAudioNet model. With adversarial training, depression classification improves for every feature when compared to the baseline. Wav2vec2.0 features with adversarial learning resulted in the best performance (F1-score of 69.2% for DAIC-WOZ and 91.5% for CONVERGE). Analysis of the classseparability measure (J-ratio) of the hidden states of the DepAu-dioNet model shows that when adversarial learning is applied, the backend model loses some speaker-discriminability while it improves depression-discriminability. These results indicate that there are some components of speaker identity that may not be useful for depression detection and minimizing their effects provides a more accurate diagnosis of the underlying disorder and can safeguard a speaker's identity.

show abstract