“…We build our model on top of a pre-trained sequence-to-sequence architecture (i.e., BART; Lewis et al 2020) fine-tuned on summarization and capable of generating fluent long text. We convert its textual encoder to a multimodal one by adding and tuning adapter layers (Rebuffi et al, 2017;Houlsby et al, 2019), Modality Input Output Datasets text-to-text text short short XSum (Narayan et al, 2018), CNN-DailyMail (Nallapati et al, 2016), NYT (Durrett et al, 2016), Gigaword (Napoles et al, 2012) text long long SamSum (Gliwa et al, 2019), QMSum (Zhong et al, 2021), SummScreen video-to-video vision short short OVP (De Avila et al, 2011), YouTube (De Avila et al, 2011), SumMe (Gygli et al, 2014) vision/text short short TVSum (Song et al, 2015) vision/text(/audio) long long LoL (Fu et al, 2017) TRIPOD+ (Papalampidi et al, 2021b) video-to-text vision long short TACoS (Rohrbach et al, 2014) vision/text/audio short short How2 (Sanabria et al, 2018) vision/text/audio long long SummScreen 3D…”