ViSiL: Fine-Grained Spatio-Temporal Video Similarity Learning

Kordopatis-Zilos, Giorgos; Papadopoulos, Symeon; Patras, Ioannis; Kompatsiaris, Ioannis

doi:10.1109/iccv.2019.00645

Cited by 63 publications

(45 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the inter-feature branch, ViSiL [24] is adapted to calculate the spatio-temporal relations between a pair of videos. The main approach of ViSiL is estimating the pairwise frame similarity between videos by apply TensorDot and mean-max filter chamfer similarity (CS) on the region frame feature.…”

Section: Inter-feature Branchmentioning

confidence: 99%

“…Figure 2. ViSiL spatio-temoral similarity scores [24] For the frame-to-frame similarity, with two video frames a, b, the region feature maps are extracted and decomposed by into region vectors a i, j , b k,l . Then, the CS is adapted to calculate the similarity:…”

Section: Inter-feature Branchmentioning

confidence: 99%

“…METHOD Normally, a video usually defined as a sequence of frames connect together in a temporal dimension. Thus, the basic approach is spliting video into frames and working with them for action recognition [21], searching [22], or information retrieval [23], [24]. For the methodology, we believe if there is a strong resemblance between an unlabeled video a with a set of videos S sharing the same label l, then there is a high probability that video a have the same label l. However, the similarity, no matter how strong it is, does not fully reflect the true nature of the relationship between a and S, as some internal features of the video does affect that relationship.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Joint inter-intra representation learning for pornographic video classification

Phan

Nguyen

et al. 2022

IJEECS

View full text Add to dashboard Cite

This paper addresses video inter-intra similarity retrieval for pornographic classification. The main approaching method is obtaining the internal representation and external similarity between a single unlabeled video and batches of labeled videos, then combining together to determine its label. For the internal representation, we extracted inner features within frames and clustered them to find the representative centroid as the intra-feature. For the external similarity, we utilized a similarity video learning named ViSiL to calculate distance score between two videos using chamfer similarity. With distance scores between input video and batches of pornographic/nonpornographic videos, the inter feature of the input video is obtained. Finally, the inter similarity vector and the intra representation are then concatenated together and fed to a final classifier to identify whether the video is for adults or not. In experiment, our method performs 96.88% accuracy on NPDI-2k, achieved a comparative result comparing to other state-of-the-art methods on the pornographic classification problem.

show abstract

Section: Inter-feature Branchmentioning

confidence: 99%

Section: Inter-feature Branchmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Joint inter-intra representation learning for pornographic video classification

Phan

Nguyen

et al. 2022

IJEECS

View full text Add to dashboard Cite

show abstract

“…A method named learning to align and match videos (LAMV) [ 19 ] is used for aligning the videos temporally. A video similarity learning network named ViSiL [ 20 ] is proposed by first computing frame-to-frame similarity and then video-to-video similarity which avoids feature aggregation before the similarity calculation between videos. A method combining CNN to extract frame features and a recurrent neural network (RNN) to retain the temporal information is also proposed by [ 21 ], but RNN is hard to train due to the excessive number of parameters needed.…”

Section: Introductionmentioning

confidence: 99%

A Supervised Video Hashing Method Based on a Deep 3D Convolutional Neural Network for Large-Scale Video Retrieval

Chen

Lee

et al. 2021

Sensors

View full text Add to dashboard Cite

Recently, with the popularization of camera tools such as mobile phones and the rise of various short video platforms, a lot of videos are being uploaded to the Internet at all times, for which a video retrieval system with fast retrieval speed and high precision is very necessary. Therefore, content-based video retrieval (CBVR) has aroused the interest of many researchers. A typical CBVR system mainly contains the following two essential parts: video feature extraction and similarity comparison. Feature extraction of video is very challenging, previous video retrieval methods are mostly based on extracting features from single video frames, while resulting the loss of temporal information in the videos. Hashing methods are extensively used in multimedia information retrieval due to its retrieval efficiency, but most of them are currently only applied to image retrieval. In order to solve these problems in video retrieval, we build an end-to-end framework called deep supervised video hashing (DSVH), which employs a 3D convolutional neural network (CNN) to obtain spatial-temporal features of videos, then train a set of hash functions by supervised hashing to transfer the video features into binary space and get the compact binary codes of videos. Finally, we use triplet loss for network training. We conduct a lot of experiments on three public video datasets UCF-101, JHMDB and HMDB-51, and the results show that the proposed method has advantages over many state-of-the-art video retrieval methods. Compared with the DVH method, the mAP value of UCF-101 dataset is improved by 9.3%, and the minimum improvement on JHMDB dataset is also increased by 0.3%. At the same time, we also demonstrate the stability of the algorithm in the HMDB-51 dataset.

show abstract

“…The knowledge transfer capability of the pretrained CNN was evaluated on several audio recognition tasks and was found to generalize well, reaching human-level accuracy on environmental sound classification. Moreover, Kordopatis et al [4] recently introduced ViSiL, a video similarity learning architecture that exploits spatio-temporal relations of the visual content to calculate the similarity between pairs of videos. It is a CNN-based approach trained to compute video-to-video similarity from frame-to-frame similarity matrices, considering intra-and inter-frame relations.…”

Section: Introductionmentioning

confidence: 99%

Audio-based Near-Duplicate Video Retrieval with Audio Similarity Learning

Avgoustinakis¹,

Kordopatis-Zilos²,

Papadopoulos³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

In this work, we address the problem of audio-based near-duplicate video retrieval. We propose the Audio Similarity Learning (AuSiL) approach that effectively captures temporal patterns of audio similarity between video pairs. For the robust similarity calculation between two videos, we first extract representative audio-based video descriptors by leveraging transfer learning based on a Convolutional Neural Network (CNN) trained on a large scale dataset of audio events, and then we calculate the similarity matrix derived from the pairwise similarity of these descriptors. The similarity matrix is subsequently fed to a CNN network that captures the temporal structures existing within its content. We train our network following a triplet generation process and optimizing the triplet loss function. To evaluate the effectiveness of the proposed approach, we have manually annotated two publicly available video datasets based on the audio duplicity between their videos. The proposed approach achieves very competitive results compared to three state-ofthe-art methods. Also, unlike the competing methods, it is very robust to the retrieval of audio duplicates generated with speed transformations.

show abstract

ViSiL: Fine-Grained Spatio-Temporal Video Similarity Learning

Cited by 63 publications

References 31 publications

Joint inter-intra representation learning for pornographic video classification

Joint inter-intra representation learning for pornographic video classification

A Supervised Video Hashing Method Based on a Deep 3D Convolutional Neural Network for Large-Scale Video Retrieval

Audio-based Near-Duplicate Video Retrieval with Audio Similarity Learning

Contact Info

Product

Resources

About