Multimodal Fusion of BERT-CNN and Gated CNN Representations for Depression Detection

Makiuchi, Mariana Rodrigues; Warnita, Tifani; Uto, Kuniaki; Sairyo, Koichi

doi:10.1145/3347320.3357694

Cited by 84 publications

(47 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For instance, authors in [ 13 ] used the bag-of-words model to encode audio and visual features and then fused them to perform multi-modal learning for depression detection. Rodrigues Makiuchi, M. in [ 14 ] used texts generated from the original speech audio by Google Cloud’s speech recognition service with their hidden embedding extracted from pretrained BERT [ 15 ] model while concatenating all modalities, achieving a concordance correlation coefficient (CCC) score of 0.69 on the AVEC 2019 DDS Challenge dataset. Aside from audio, video, and text modalities, methods proposed in [ 16 ] employed body gestures as one of the modalities to perform early fusion.…”

Section: Related Workmentioning

confidence: 99%

Multi-Modal Adaptive Fusion Transformer Network for the Estimation of Depression Level

Sun

Liu

Chai

et al. 2021

Sensors

View full text Add to dashboard Cite

Depression is a severe psychological condition that affects millions of people worldwide. As depression has received more attention in recent years, it has become imperative to develop automatic methods for detecting depression. Although numerous machine learning methods have been proposed for estimating the levels of depression via audio, visual, and audiovisual emotion sensing, several challenges still exist. For example, it is difficult to extract long-term temporal context information from long sequences of audio and visual data, and it is also difficult to select and fuse useful multi-modal information or features effectively. In addition, how to include other information or tasks to enhance the estimation accuracy is also one of the challenges. In this study, we propose a multi-modal adaptive fusion transformer network for estimating the levels of depression. Transformer-based models have achieved state-of-the-art performance in language understanding and sequence modeling. Thus, the proposed transformer-based network is utilized to extract long-term temporal context information from uni-modal audio and visual data in our work. This is the first transformer-based approach for depression detection. We also propose an adaptive fusion method for adaptively fusing useful multi-modal features. Furthermore, inspired by current multi-task learning work, we also incorporate an auxiliary task (depression classification) to enhance the main task of depression level regression (estimation). The effectiveness of the proposed method has been validated on a public dataset (AVEC 2019 Detecting Depression with AI Sub-challenge) in terms of the PHQ-8 scores. Experimental results indicate that the proposed method achieves better performance compared with currently state-of-the-art methods. Our proposed method achieves a concordance correlation coefficient (CCC) of 0.733 on AVEC 2019 which is 6.2% higher than the accuracy (CCC = 0.696) of the state-of-the-art method.

show abstract

Section: Related Workmentioning

confidence: 99%

Multi-Modal Adaptive Fusion Transformer Network for the Estimation of Depression Level

Sun

Liu

Chai

et al. 2021

Sensors

View full text Add to dashboard Cite

show abstract

“…Attempts to use external structures for this (such as visual indexes [31] or ontologies [32]) lead to significant losses in context, which in many cases decreases the benefits of multimodal fusion. Therefore, in recent publications, approaches related to the use of features that preserve contextual domain dependencies dominate [18,19,[21][22][23], and УПРАВЛЕНИЕ В МЕДИЦИНЕ И БИОЛОГИИ the deep learning methods are used as a technological base.…”

Section: Background and Related Workmentioning

confidence: 99%

“…In tasks of semantic processing of medical texts, contextual word embeddings, primarily BERT [38], consisting of multiple layers of transformers which use self-attention mechanism, show the best results [40,42,43]. For example, for fusing text and speech in depression detection [19] features were extracted by BERT-CNN and VGG-16 CNN in combination with Gated Convolutional Neural Network (GCNN) followed by a LSTM layer. Additionally, [42] shows that BERT performs better than traditional word embedding methods in feature extraction tasks, and the BERT pre-trained on the clinical texts shows itself better than pre-trained on the general domain texts.…”

Section: Background and Related Workmentioning

confidence: 99%

“…Late fusion is rather popular in different applications [7,18,19,37]. However, as noted in [47], this can be mainly due to resource reasons: the network for each unimodal stream can be designed and pre-trained independently for each modality.…”

Section: Background and Related Workmentioning

confidence: 99%

“…With the development of high-tech diagnostic tools, multimodal fusion is becoming increasingly prominent in medicine. Here, such areas are actively developing as predicting the patient's health based on genomic, transcriptomic, and lifestyle information of one [17] as well as predicting the development of certain diseases [18,19]. For instance, multimodal fusion in neuroimaging is actively developing these days [20][21][22][23][24].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Comparative assessment of text-image fusion models for medical diagnostics

Gusarova

Lobantsev

Vatian

et al. 2020

ICS

View full text Add to dashboard Cite

Introduction: Information overload and complexity are characteristic of decision-making in medicine. In these conditions information fusion techniques are effective. For the diagnosis and treatment of pneumonia using x-ray images and accompaniyng free-text radiologists reports, it is promising to use text-image fusion. Purpose: Development of a method for fusing text with an image in the treatment and diagnosis of pneumonia using neural networks. Methods: We used MIMIC-CXR dataset, the SEResNeXt101-32x4d for images feature extracting and the Bio-ClinicalBERT model followed by ContextLSTM layer for text featureextracting. We compared five architectures in the conducted experiment: image classifier, report classifier and three scenarios of the fusion, namely late fusion, middle fusion and early fusion. Results: We got an absolute excess of metrics (ROC AUC = 0.9933, PR AUC = 0.9907) when using an early fusion classifier (ROC AUC = 0.9921, PR AUC = 0.9889) even over the idealized case of text classifier (that is, without taking into account possible errors of the radiologist). The network training time ranged from 20 minutes for late fusion to 9 hours and 45 minutes for early fusion. Based on Class Activation Map technique we graphically showed that the image feature extractor in the fused classification scenario still learns discriminative regions for pneumonia classification problem. Discussion: Fusing text and images increases the likelihood of correct image classification compared to only image classification. The proposed combined image-report classifier trained with the early-fusion method gives better performance than individual classifiers in the pneumonia classification problem. However, it is worth considering that better results cost the training time and required computation resources. Report-based training is much faster in training and less demanding for computation capacity.

show abstract

An Adaptive and Efficient Method for Detecting First Signs of Depression with Information from the Social Web

Cagnina

Errecalde

Ucelay

et al. 2020

Computer Science – CACIC 2019

View full text Add to dashboard Cite

Multimodal Fusion of BERT-CNN and Gated CNN Representations for Depression Detection

Cited by 84 publications

References 32 publications

Multi-Modal Adaptive Fusion Transformer Network for the Estimation of Depression Level

Multi-Modal Adaptive Fusion Transformer Network for the Estimation of Depression Level

Comparative assessment of text-image fusion models for medical diagnostics

An Adaptive and Efficient Method for Detecting First Signs of Depression with Information from the Social Web

Contact Info

Product

Resources

About