“…The multi-modal retrieval task has been studied using various modalities such as image-text retrieval Zhang et al, 2020;Cheng et al, 2022;Luo et al, 2022;Xuan and Chen, 2023), video-text Gorti et al, 2022;, audio-image (Xu, 2020;Nakatsuka et al, 2023), video-audio (Surís et al, 2018;Gu et al, 2023; and audio-text (Kim et al, 2022;Xin et al, 2023). Particularly, CLIP4CLIP (Luo et al, 2022), which performs well in the videotext retrieval task by calculating the similarities between the features of each modality obtained from the encoder, and X-CLIP expands CLIP4CLIP and proposes a multi-grained regulation function to improve performance.…”