2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00645
|View full text |Cite
|
Sign up to set email alerts
|

ViSiL: Fine-Grained Spatio-Temporal Video Similarity Learning

Abstract: In this paper we introduce ViSiL, a Video Similarity Learning architecture that considers fine-grained Spatio-Temporal relations between pairs of videos -such relations are typically lost in previous video retrieval approaches that embed the whole frame or even the whole video into a vector descriptor before the similarity estimation. By contrast, our Convolutional Neural Network (CNN)-based approach is trained to calculate video-to-video similarity from refined frame-to-frame similarity matrices, so as to con… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
43
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 52 publications
(43 citation statements)
references
References 31 publications
(73 reference statements)
0
43
0
Order By: Relevance
“…In the inter-feature branch, ViSiL [24] is adapted to calculate the spatio-temporal relations between a pair of videos. The main approach of ViSiL is estimating the pairwise frame similarity between videos by apply TensorDot and mean-max filter chamfer similarity (CS) on the region frame feature.…”
Section: Inter-feature Branchmentioning
confidence: 99%
See 2 more Smart Citations
“…In the inter-feature branch, ViSiL [24] is adapted to calculate the spatio-temporal relations between a pair of videos. The main approach of ViSiL is estimating the pairwise frame similarity between videos by apply TensorDot and mean-max filter chamfer similarity (CS) on the region frame feature.…”
Section: Inter-feature Branchmentioning
confidence: 99%
“…Figure 2. ViSiL spatio-temoral similarity scores [24] For the frame-to-frame similarity, with two video frames a, b, the region feature maps are extracted and decomposed by into region vectors a i, j , b k,l . Then, the CS is adapted to calculate the similarity:…”
Section: Inter-feature Branchmentioning
confidence: 99%
See 1 more Smart Citation
“…A method named learning to align and match videos (LAMV) [ 19 ] is used for aligning the videos temporally. A video similarity learning network named ViSiL [ 20 ] is proposed by first computing frame-to-frame similarity and then video-to-video similarity which avoids feature aggregation before the similarity calculation between videos. A method combining CNN to extract frame features and a recurrent neural network (RNN) to retain the temporal information is also proposed by [ 21 ], but RNN is hard to train due to the excessive number of parameters needed.…”
Section: Introductionmentioning
confidence: 99%
“…The knowledge transfer capability of the pretrained CNN was evaluated on several audio recognition tasks and was found to generalize well, reaching human-level accuracy on environmental sound classification. Moreover, Kordopatis et al [4] recently introduced ViSiL, a video similarity learning architecture that exploits spatio-temporal relations of the visual content to calculate the similarity between pairs of videos. It is a CNN-based approach trained to compute video-to-video similarity from frame-to-frame similarity matrices, considering intra-and inter-frame relations.…”
Section: Introductionmentioning
confidence: 99%