“…For multi-modal fusion modules, existing methods can be classified into two categories (i.e., single-stream and dual-stream). In specific, for the single-stream fusion, the models [8,27,28,43] use a single Transformer for early and unconstrained fusion between modalities; for the dual-stream fusion, the models [35,47,55] adopt the co-attention mechanism to interact different modalities. For pretext tasks, inspired by uni-modal pre-training schemes such as MLM [10,33] and causal language modeling [6], existing studies explore a variety of pre-training tasks, including MLM [27,35,47], MIM [8,35], ITM [27,58], image-text contrastive [26] and prefix language modeling [51].…”