2023
DOI: 10.1007/978-3-031-27077-2_58
|View full text |Cite
|
Sign up to set email alerts
|

Free-Form Multi-Modal Multimedia Retrieval (4MR)

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 14 publications
0
4
0
Order By: Relevance
“…[99] VISIONE [31] OpenCLIP ViT-L/14 trained with LAION-400m [60] diveXplore [100] OpenCLIP ViT-B/32 trained with LAION-2B [12], [60] 4MR [34] OpenCLIP ViT-B/32 xlm roberta base model trained with LAION-5B [13], [60] vitrivr [96] vitrivr-VR [107] CLIP [5], [86] CVHunter [71] vitrivr [96] vitrivr-VR [107] CLIP2Video [6], [45] VISIONE [31] BLIP [3], [66] QIVISE [103] CLIP4Clip [7], [77] VIREO [79] Custom cross-modal network [20], [46] combining multiple textual and visual features and employing OpenCLIP ViT-B/32 [60], [86], ResNet-152 [53], and ResNeXt-101 [80] Verge [84] ITV [116] VIREO [79] ALADIN [2], [81] VISIONE [31] custom model [24], [105] vitrivr [96] vitrivr-VR [107] The VBS systems have greatly evolved in recent years, offering innovative approaches to efficiently explore and retrieve information from large video collections. Almost all these systems exploit joint text-visual embeddings to enhance the search experience and provide more accurate results.…”
Section: Model Systemmentioning
confidence: 99%
See 3 more Smart Citations
“…[99] VISIONE [31] OpenCLIP ViT-L/14 trained with LAION-400m [60] diveXplore [100] OpenCLIP ViT-B/32 trained with LAION-2B [12], [60] 4MR [34] OpenCLIP ViT-B/32 xlm roberta base model trained with LAION-5B [13], [60] vitrivr [96] vitrivr-VR [107] CLIP [5], [86] CVHunter [71] vitrivr [96] vitrivr-VR [107] CLIP2Video [6], [45] VISIONE [31] BLIP [3], [66] QIVISE [103] CLIP4Clip [7], [77] VIREO [79] Custom cross-modal network [20], [46] combining multiple textual and visual features and employing OpenCLIP ViT-B/32 [60], [86], ResNet-152 [53], and ResNeXt-101 [80] Verge [84] ITV [116] VIREO [79] ALADIN [2], [81] VISIONE [31] custom model [24], [105] vitrivr [96] vitrivr-VR [107] The VBS systems have greatly evolved in recent years, offering innovative approaches to efficiently explore and retrieve information from large video collections. Almost all these systems exploit joint text-visual embeddings to enhance the search experience and provide more accurate results.…”
Section: Model Systemmentioning
confidence: 99%
“…This server extracts embeddings from a text query, compares them with an L2 distance to the visual embeddings of the keyframes, and returns the ranked results via a WebSocket connection to the frontend. 4MR [34] also uses a CLIP model, the ViT-B/32 [12], [60], [86] pretrained on LAION-2B. A Python server in the backend transforms the input to a vector, which is afterward used for similarity search.…”
Section: Model Systemmentioning
confidence: 99%
See 2 more Smart Citations