“…dle the text input.Multimodal summarization aims to condense information from multimodal inputs, such as text, vision, and audio[12]. Recently, Ms has been extensively studied ([10,2,8,31]). A large number of the works focus on fusing visual information to improve the quality of text summaries [11].…”