2022
DOI: 10.1007/978-3-030-98355-0_52
|View full text |Cite
|
Sign up to set email alerts
|

VISIONE at Video Browser Showdown 2022

Abstract: VISIONE is a content-based retrieval system that supports various search functionalities (text search, object/color-based search, semantic and visual similarity search, temporal search). It uses a full-text search engine as a search backend. In the latest version of our system, we modified the user interface, and we made some changes to the techniques used to analyze and search for videos.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1
1

Relationship

2
5

Authors

Journals

citations
Cited by 10 publications
(10 citation statements)
references
References 24 publications
(32 reference statements)
0
5
0
Order By: Relevance
“…2 https://lucene.apache.org/ 3 https://github.com/facebookresearch/faiss 4 We leave the investigation of a STR technique that is suitable for indexing this type of dense vector to future work. Specifically, it implements object queries by placing the desired objects or colors in a canvas, it allows video searching by specifying natural language descriptions of desired keyframes or shots, and it supports temporal queries for finding consecutive specific events.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…2 https://lucene.apache.org/ 3 https://github.com/facebookresearch/faiss 4 We leave the investigation of a STR technique that is suitable for indexing this type of dense vector to future work. Specifically, it implements object queries by placing the desired objects or colors in a canvas, it allows video searching by specifying natural language descriptions of desired keyframes or shots, and it supports temporal queries for finding consecutive specific events.…”
Section: Discussionmentioning
confidence: 99%
“…Therefore, for the CLIP2Video features, the approximated cosine similarity computed in the STR representation badly approximates the original one. For these reasons, for the CLIP-based features, we instead relied on the FAISS index, using an exact search and an 8-bit scalar quantization to reduce the index size in memory 4 . Despite the exact search, with the in-memory quantized index, the search over the full V3C1 + V3C2 shots takes only a few milliseconds at a cost of much bigger memory utilization.…”
Section: Indexingmentioning
confidence: 99%
See 1 more Smart Citation
“…[99] VISIONE [31] OpenCLIP ViT-L/14 trained with LAION-400m [60] diveXplore [100] OpenCLIP ViT-B/32 trained with LAION-2B [12], [60] 4MR [34] OpenCLIP ViT-B/32 xlm roberta base model trained with LAION-5B [13], [60] vitrivr [96] vitrivr-VR [107] CLIP [5], [86] CVHunter [71] vitrivr [96] vitrivr-VR [107] CLIP2Video [6], [45] VISIONE [31] BLIP [3], [66] QIVISE [103] CLIP4Clip [7], [77] VIREO [79] Custom cross-modal network [20], [46] combining multiple textual and visual features and employing OpenCLIP ViT-B/32 [60], [86], ResNet-152 [53], and ResNeXt-101 [80] Verge [84] ITV [116] VIREO [79] ALADIN [2], [81] VISIONE [31] custom model [24], [105] vitrivr [96] vitrivr-VR [107] The VBS systems have greatly evolved in recent years, offering innovative approaches to efficiently explore and retrieve information from large video collections. Almost all these systems exploit joint text-visual embeddings to enhance the search experience and provide more accurate results.…”
Section: Model Systemmentioning
confidence: 99%
“…STR-based methods, on the other hand, rely on transformations that sparsify data and encode it as small sets of codewords indexed on standard text engines [9,2,4]. These approaches are successfully used to solve multimodal queries for combined text search with image similarity [1,3].…”
Section: Introductionmentioning
confidence: 99%