“…Learn shared representations from weighted modality-specific representations Gated Multimodal Unit (GMU) [429], parallel attention model, attention layer, sparse MLP (mix vertical and horizontal information via weight sharing and sparse connection), multimodal encoder-decoder, multimodal factorized bilinear pooling (combines compact output features of multi-modal low-rank bilinear [430] and robustness of multi-modal compact bilinear [431]), multi-head intermodal attention fusion, transformer [295], feed-forward network, low-rank multimodal fusion network [432] [ 62,65,67,76,93,100,102,106,113,117,131,135,136,[142][143][144]174,218,433] Learn joint sparse representations Dictionary learning [20] Learn and fuse outputs from different modality-specific parts at fixed time steps Cell-coupled LSTM with L-skip fusion mechanism [101] Learn cross-modality representations that incorporate interactions between modalities LXMERT [434], transformer encoder with cross-attention layers (representations of a modality as query and the other as key/value, and vice versa), memory fusion network [435] [82, 92,129] Horizontal and vertical kernels to capture patterns across different levels CASER [309] [170]…”