“…It has various applications such as robotic navigation, video entertainment, and autonomous driving, to *Shucheng Huang(schuang@just.edu.cn) is the corresponding author. name a few [1,2,3,4,5]. Despite much progress has been achieved in recent years [6,7,8,9,10,11,12,13], VMR remains difficult due to the harsh nature of videos and texts, including complex temporal relations, fine-grained semantic structures, and huge cross-modal gap between visual and textual features [11,14,15,16].…”