Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop 2019
DOI: 10.1145/3347320.3357694
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Fusion of BERT-CNN and Gated CNN Representations for Depression Detection

Abstract: Depression is a common, but serious mental disorder that affects people all over the world. Besides providing an easier way of diagnosing the disorder, a computer-aided automatic depression assessment system is demanded in order to reduce subjective bias in the diagnosis. We propose a multimodal fusion of speech and linguistic representation for depression detection. We train our model to infer the Patient Health Questionnaire (PHQ) score of subjects from AVEC 2019 DDS Challenge database, the E-DAIC corpus. Fo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
41
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 84 publications
(47 citation statements)
references
References 32 publications
0
41
0
Order By: Relevance
“…For instance, authors in [ 13 ] used the bag-of-words model to encode audio and visual features and then fused them to perform multi-modal learning for depression detection. Rodrigues Makiuchi, M. in [ 14 ] used texts generated from the original speech audio by Google Cloud’s speech recognition service with their hidden embedding extracted from pretrained BERT [ 15 ] model while concatenating all modalities, achieving a concordance correlation coefficient (CCC) score of 0.69 on the AVEC 2019 DDS Challenge dataset. Aside from audio, video, and text modalities, methods proposed in [ 16 ] employed body gestures as one of the modalities to perform early fusion.…”
Section: Related Workmentioning
confidence: 99%
“…For instance, authors in [ 13 ] used the bag-of-words model to encode audio and visual features and then fused them to perform multi-modal learning for depression detection. Rodrigues Makiuchi, M. in [ 14 ] used texts generated from the original speech audio by Google Cloud’s speech recognition service with their hidden embedding extracted from pretrained BERT [ 15 ] model while concatenating all modalities, achieving a concordance correlation coefficient (CCC) score of 0.69 on the AVEC 2019 DDS Challenge dataset. Aside from audio, video, and text modalities, methods proposed in [ 16 ] employed body gestures as one of the modalities to perform early fusion.…”
Section: Related Workmentioning
confidence: 99%
“…Attempts to use external structures for this (such as visual indexes [31] or ontologies [32]) lead to significant losses in context, which in many cases decreases the benefits of multimodal fusion. Therefore, in recent publications, approaches related to the use of features that preserve contextual domain dependencies dominate [18,19,[21][22][23], and УПРАВЛЕНИЕ В МЕДИЦИНЕ И БИОЛОГИИ the deep learning methods are used as a technological base.…”
Section: Background and Related Workmentioning
confidence: 99%
“…In tasks of semantic processing of medical texts, contextual word embeddings, primarily BERT [38], consisting of multiple layers of transformers which use self-attention mechanism, show the best results [40,42,43]. For example, for fusing text and speech in depression detection [19] features were extracted by BERT-CNN and VGG-16 CNN in combination with Gated Convolutional Neural Network (GCNN) followed by a LSTM layer. Additionally, [42] shows that BERT performs better than traditional word embedding methods in feature extraction tasks, and the BERT pre-trained on the clinical texts shows itself better than pre-trained on the general domain texts.…”
Section: Background and Related Workmentioning
confidence: 99%
See 2 more Smart Citations