Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Zhang, Chao; Yang, Zichao; He, Xiaodong; Deng, Li

doi:10.1109/jstsp.2020.2987728

Cited by 230 publications

(79 citation statements)

References 199 publications

(203 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…These diverse modalities differ in their scales, representation format, varied predictive power, weights, and contributions towards the final task [9]. Optimal data fusion schemes such as early [11], late [48], and hybrid fusion [49] schemes are developed to fuse the modalities at data, feature, decision, and intermediate mixed levels [50]. Deep neural nets [51],kernel-based methods [52], and graphical models [47,48] are employed for analysis and handling such data depending on the downstream task [46].…”

Section: Multimodal Machine Learningmentioning

confidence: 99%

A Review on Explainability in Multimodal Deep Neural Nets

2021

View full text Add to dashboard Cite

Artificial Intelligence techniques powered by deep neural nets have achieved much success in several application domains, most significantly and notably in the Computer Vision applications and Natural Language Processing tasks. Surpassing human-level performance propelled the research in the applications where different modalities amongst language, vision, sensory, text play an important role in accurate predictions and identification. Several multimodal fusion methods employing deep learning models are proposed in the literature. Despite their outstanding performance, the complex, opaque and black-box nature of the deep neural nets limits their social acceptance and usability. This has given rise to the quest for model interpretability and explainability, more so in the complex tasks involving multimodal AI methods. This paper extensively reviews the present literature to present a comprehensive survey and commentary on the explainability in multimodal deep neural nets, especially for the vision and language tasks. Several topics on multimodal AI and its applications for generic domains have been covered in this paper, including the significance, datasets, fundamental building blocks of the methods and techniques, challenges, applications, and future trends in this domain.INDEX TERMS deep multimodal learning, explainable AI, interpretability, survey, trends, vision and language research, XAI.

show abstract

Section: Multimodal Machine Learningmentioning

confidence: 99%

A Review on Explainability in Multimodal Deep Neural Nets

2021

View full text Add to dashboard Cite

show abstract

“…Multimodal Machine Learning —There is a long history of research in this area, exploring different directions [ 30 , 31 , 32 ]. Representation learning [ 33 , 34 , 35 ] is one of such directions in which effective and robust joint features are learned, typically from large-scale data sets, to be used in general downstream tasks, such as visual question answering or visual commonsense reasoning.…”

Section: Related Workmentioning

confidence: 99%

Multimodal Classification of Parkinson’s Disease in Home Environments with Resiliency to Missing Modalities

Heidarivincheh

McConville

Morgan

et al. 2021

Sensors

View full text Add to dashboard Cite

Parkinson’s disease (PD) is a chronic neurodegenerative condition that affects a patient’s everyday life. Authors have proposed that a machine learning and sensor-based approach that continuously monitors patients in naturalistic settings can provide constant evaluation of PD and objectively analyse its progression. In this paper, we make progress toward such PD evaluation by presenting a multimodal deep learning approach for discriminating between people with PD and without PD. Specifically, our proposed architecture, named MCPD-Net, uses two data modalities, acquired from vision and accelerometer sensors in a home environment to train variational autoencoder (VAE) models. These are modality-specific VAEs that predict effective representations of human movements to be fused and given to a classification module. During our end-to-end training, we minimise the difference between the latent spaces corresponding to the two data modalities. This makes our method capable of dealing with missing modalities during inference. We show that our proposed multimodal method outperforms unimodal and other multimodal approaches by an average increase in F1-score of 0.25 and 0.09, respectively, on a data set with real patients. We also show that our method still outperforms other approaches by an average increase in F1-score of 0.17 when a modality is missing during inference, demonstrating the benefit of training on multiple modalities.

show abstract

“…Despite the growing amount of works on LSM, most of these methods have neglected the importance of utilizing the learned representation from well trained model, which causes them fail to transfer the learned knowledge to other datasets. Learning good representation from the input data is a core problem for unsupervised learning [12]. Although some recent studies [13] and [14] have shown good performance on extracting representation with impressive properties, there still exist some key problems needed to be fulfilled.…”

Section: A Objectivesmentioning

confidence: 99%

Unsupervised Feature Learning to Improve Transferability of Landslide Susceptibility Representations

Zhu

Chen

Han

et al. 2020

IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing

View full text Add to dashboard Cite

A landslide susceptibility map (LSM) is of vital importance for risk recognition and prevention. In the last decade, statistical methods have gradually exerted their impact on mapping the landslide susceptibility to locate the high-risk places of landslide. However, due to the complexity of getting full access to the thematic information in large scenarios, most of these statistical methods generally suffer from overfitting, inadequate representative power, and the inability to transfer the learned representation to other places. To solve these challenges, this study designed an unsupervised representation learning module, which features independence, compactness, robustness, and transferability. Specifically, we first stack restricted Boltzmann machines (RBMs) and denoising autoencoder (DAE) to unsupervised discover the underlying representations embedded in the thematic maps. Then we applied the transferring strategy in an adversarial manner to generalize the learned representations to the sample-scarce area. Experimental results and analyses using data in different regions have revealed that the proposed method can be generalized well between different LSM scenarios. In terms of precision, it outperforms other methods by a large margin, e.g. by around 7% compared to multilayer perceptrons (MLP) with the same configuration, and by 3%-4% to the state of art algorithm random forest (RF). Besides, compared to other methods, the landslide susceptibility map that is predicted by the proposed method featuring smoothness and stableness seems more reliable, and is more according to some prior knowledge that, for example, distance to the drainage, slope, stratum should exert dominant effects on the occurrence of a landslide.

show abstract

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Cited by 230 publications

References 199 publications

A Review on Explainability in Multimodal Deep Neural Nets

A Review on Explainability in Multimodal Deep Neural Nets

Multimodal Classification of Parkinson’s Disease in Home Environments with Resiliency to Missing Modalities

Unsupervised Feature Learning to Improve Transferability of Landslide Susceptibility Representations

Contact Info

Product

Resources

About