“…As a fundamental task in visual-language understanding Xu et al, 2021b;Park et al, 2022a;Miyawaki et al, 2022;Fang et al, 2023a,b;Kim et al, 2023;Jian and Wang, 2023), video-text retrieval (VTR) (Luo et al, 2022;Gao et al, 2021b;Ma et al, 2022a;Liu et al, 2022a;Zhao et al, 2022;Gorti et al, 2022;Fang et al, 2022) has attracted interest from academia and industry. Although recent years have witnessed the rapid development of VTR with the support from powerful pretraining models (Luo et al, 2022;Gao et al, 2021b;Ma et al, 2022a;Liu et al, 2022a), improved retrieval methods (Bertasius et al, 2021;Dong et al, 2019;, and videolanguage datasets construction (Xu et al, 2016), it remains challenging to precisely match video and language due to the raw data being in heterogeneous spaces with significant differences.…”