Self-supervised Disentanglement of Modality-Specific and Shared Factors Improves Multimodal Generative Models

Daunhawer, Imant; Sutter, Thomas M.; Marcinkevičs, Ričards; Vogt, Julia E.

doi:10.1007/978-3-030-71278-5_33

Cited by 14 publications

(23 citation statements)

References 13 publications

(5 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, the stability of their learning remains a challenge, such as mode collapse, which is confirmed to be more serious in a multimodal setting [113]. Therefore, VAEs have become the mainstream in multimodal deep generative models, where GANs are sometimes used to improve the quality of the generation of multimodal VAEs [92,113] or to implement divergence between distributions in VAEs [15]. Table 1.…”

Section: Advantages Of Vaes As Multimodal Generative Modelsmentioning

confidence: 99%

“…× × (memory cost) MVAE [112] ( ) (require sub-sampling) × MMVAE [79] × (MoE) × mmJSD [88] × (MoE) × mmJSD (MS) [88] × (MoE) AVAE [118] × × MoPoE-VAE [89] × × (computational cost) PVAE [33] × (memory cost) DMVAE [48] DMVAE [15] MFM [99] × (memory cost) [78] × (MoE) ×…”

Section: Modelsmentioning

confidence: 99%

“…Wu et al [112] also proposed a method for a weakly-supervised learning setting where all modalities are not present at training, followed by Shi et al [78]. Modality-specific latent variables were first introduced into JVAE by Hsu et al [33] and Tsai et al [99], and then models with various inference methods and objectives were proposed [15,48,88].…”

Section: Joint Modelmentioning

confidence: 99%

“…Daunhawer et al [15] introduce the following two objective in addition to the objective of JVAE. The first objective is the mutual information between the multimodal input X and the modality-invariant representation c, which imposes the inclusion of multimodal information in c. However, since it is difficult to compute this mutual information directly, they estimate its lower bound using the sample-based InfoNCE estimator [65].…”

Section: Modality-specific Latent Variablesmentioning

confidence: 99%

See 3 more Smart Citations

A survey of multimodal deep generative models

Suzuki

Matsuo

2022

Advanced Robotics

View full text Add to dashboard Cite

Multimodal learning is a framework for building models that make predictions based on different types of modalities. Important challenges in multimodal learning are the inference of shared representations from arbitrary modalities and cross-modal generation via these representations; however, achieving this requires taking the heterogeneous nature of multimodal data into account.In recent years, deep generative models, i.e., generative models in which distributions are parameterized by deep neural networks, have attracted much attention, especially variational autoencoders, which are suitable for accomplishing the above challenges because they can consider heterogeneity and infer good representations of data. Therefore, various multimodal generative models based on variational autoencoders, called multimodal deep generative models, have been proposed in recent years. In this paper, we provide a categorized survey of studies on multimodal deep generative models.

show abstract

Section: Advantages Of Vaes As Multimodal Generative Modelsmentioning

confidence: 99%

Section: Modelsmentioning

confidence: 99%

Section: Joint Modelmentioning

confidence: 99%

Section: Modality-specific Latent Variablesmentioning

confidence: 99%

See 2 more Smart Citations

A survey of multimodal deep generative models

Suzuki

Matsuo

2022

Advanced Robotics

View full text Add to dashboard Cite

show abstract

“…Similar to previous work, we have only considered models with simple priors, such as Gauss and Laplace distributions with independent dimensions. Further, we have not considered models with modality-specific latent spaces, which seem to yield better empirical results (Hsu and Glass, 2018;Sutter et al, 2020;Daunhawer et al, 2020), but currently lack theoretical grounding. Modality-specific latent spaces offer a potential solution to the problem of cross-modal prediction by providing modality-specific context from the target modalities to each decoder.…”

Section: Discussionmentioning

confidence: 99%

On the Limitations of Multimodal VAEs

Daunhawer¹,

Sutter²,

Chin-Cheong³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Multimodal variational autoencoders (VAEs) have shown promise as efficient generative models for weakly-supervised data. Yet, despite their advantage of weak supervision, they exhibit a gap in generative quality compared to unimodal VAEs, which are completely unsupervised. In an attempt to explain this gap, we uncover a fundamental limitation that applies to a large family of mixture-based multimodal VAEs. We prove that the sub-sampling of modalities enforces an undesirable upper bound on the multimodal ELBO and thereby limits the generative quality of the respective models. Empirically, we showcase the generative quality gap on both synthetic and real data and present the tradeoffs between different variants of multimodal VAEs. We find that none of the existing approaches fulfills all desired criteria of an effective multimodal generative model when applied on more complex datasets than those used in previous benchmarks. In summary, we identify, formalize, and validate fundamental limitations of VAE-based approaches for modeling weakly-supervised data and discuss implications for real-world applications.

show abstract