“…[99] VISIONE [31] OpenCLIP ViT-L/14 trained with LAION-400m [60] diveXplore [100] OpenCLIP ViT-B/32 trained with LAION-2B [12], [60] 4MR [34] OpenCLIP ViT-B/32 xlm roberta base model trained with LAION-5B [13], [60] vitrivr [96] vitrivr-VR [107] CLIP [5], [86] CVHunter [71] vitrivr [96] vitrivr-VR [107] CLIP2Video [6], [45] VISIONE [31] BLIP [3], [66] QIVISE [103] CLIP4Clip [7], [77] VIREO [79] Custom cross-modal network [20], [46] combining multiple textual and visual features and employing OpenCLIP ViT-B/32 [60], [86], ResNet-152 [53], and ResNeXt-101 [80] Verge [84] ITV [116] VIREO [79] ALADIN [2], [81] VISIONE [31] custom model [24], [105] vitrivr [96] vitrivr-VR [107] The VBS systems have greatly evolved in recent years, offering innovative approaches to efficiently explore and retrieve information from large video collections. Almost all these systems exploit joint text-visual embeddings to enhance the search experience and provide more accurate results.…”