Free-Form Multi-Modal Multimedia Retrieval (4MR)

Arnold, Rahel; Sauter, Loris; Schuldt, Heiko

doi:10.1007/978-3-031-27077-2_58

Cited by 3 publications

(4 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[99] VISIONE [31] OpenCLIP ViT-L/14 trained with LAION-400m [60] diveXplore [100] OpenCLIP ViT-B/32 trained with LAION-2B [12], [60] 4MR [34] OpenCLIP ViT-B/32 xlm roberta base model trained with LAION-5B [13], [60] vitrivr [96] vitrivr-VR [107] CLIP [5], [86] CVHunter [71] vitrivr [96] vitrivr-VR [107] CLIP2Video [6], [45] VISIONE [31] BLIP [3], [66] QIVISE [103] CLIP4Clip [7], [77] VIREO [79] Custom cross-modal network [20], [46] combining multiple textual and visual features and employing OpenCLIP ViT-B/32 [60], [86], ResNet-152 [53], and ResNeXt-101 [80] Verge [84] ITV [116] VIREO [79] ALADIN [2], [81] VISIONE [31] custom model [24], [105] vitrivr [96] vitrivr-VR [107] The VBS systems have greatly evolved in recent years, offering innovative approaches to efficiently explore and retrieve information from large video collections. Almost all these systems exploit joint text-visual embeddings to enhance the search experience and provide more accurate results.…”

Section: Model Systemmentioning

confidence: 99%

“…This server extracts embeddings from a text query, compares them with an L2 distance to the visual embeddings of the keyframes, and returns the ranked results via a WebSocket connection to the frontend. 4MR [34] also uses a CLIP model, the ViT-B/32 [12], [60], [86] pretrained on LAION-2B. A Python server in the backend transforms the input to a vector, which is afterward used for similarity search.…”

Section: Model Systemmentioning

confidence: 99%

“…At any point while watching a video, the current frame can be used as a source image for query-by-example using these features. 4MR [34] employs the CLIP model ViT-B/32 [12], [86] for query-by-example. In an offline phase, all keyframes were extracted beforehand.…”

Section: Query By Examplementioning

confidence: 99%

“…In 4MR's [34] browsing window, results are arranged in a grid, with the first 500 displayed. Each video segment is represented by its keyframe.…”

Section: G Browsingmentioning

confidence: 99%

See 3 more Smart Citations

Evaluating Performance and Trends in Interactive Video Retrieval: Insights From the 12th VBS Competition

Vadicamo,

Arnold,

Bailer

et al. 2024

IEEE Access

View full text Add to dashboard Cite

This paper conducts a thorough examination of the 12th Video Browser Showdown (VBS) competition, which is a well-established international benchmarking campaign for interactive video search systems. The annual VBS competition has witnessed a steep rise in the popularity of multimodal embedding-based approaches in interactive video retrieval. The majority of the thirteen systems participating in VBS 2023 utilized a CLIP-based cross-modal search model, allowing the specification of free-form text queries to search visual content. This shared emphasis on joint embedding models contributed to balanced performance across various teams. However, the distinguishing factors of the top-performing teams included the adept combination of multiple models and search modes, along with the capabilities of interactive interfaces to facilitate and refine the search process. Our work provides an overview of the state-of-the-art approaches employed by the participating systems and conducts a thorough analysis of their search logs, which record user interactions and results of their queries for each task. Our comprehensive examination of the VBS competition offers assessments of the effectiveness of the retrieval models employed, the browsing efficiency, and user query patterns. Additionally, it provides valuable insights into the evolving landscape of interactive video retrieval and its future challenges.

show abstract