Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers

Kordopatis-Zilos, Giorgos; Papadopoulos, Symeon; Patras, Ioannis; Kompatsiaris, Ioannis

doi:10.1007/978-3-319-51811-4_21

Cited by 60 publications

(44 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The proposed unsupervised NDVR approach relies on a Bag-of-Words (BoW) scheme [27]. In particular, two aggregation variations are proposed: a vector aggregation where a single codebook of visual words is used, and a layer aggregation where multiple codebooks of visual words are used.…”

Section: Bag-of-words Approachmentioning

confidence: 99%

“…In Section 4.2, we review the related literature in the field of NDVR by providing an outline of the major trends in the field. In Section 4.3, we present the two aforementioned NDVR approaches that have been developed within the InVID project [27,28]. In Section 4.4, we report on the results of a comprehensive experimental study, including a comparison with five state-of-the-art methods.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Finding Near-Duplicate Videos in Large-Scale Collections

Kordopatis-Zilos

Papadopoulos

Patras

et al. 2019

Video Verification in the Fake News Era

Self Cite

View full text Add to dashboard Cite

This chapter discusses the problem of Near-Duplicate Video Retrieval (NDVR). The main objective of a typical NDVR approach is: given a query video, retrieve all near-duplicate videos in a video repository and rank them based on their similarity to the query. Several approaches have been introduced in the literature, which can be roughly classified in three categories based on the level of video matching, i.e. (i) video-level, (ii) frame-level and (iii) filter-and-refine matching. Two methods based on video-level matching are presented in this chapter. The first is an unsupervised scheme that relies on a modified Bag-of-Word (BoW) video representation. The second is a supervised method based on Deep Metric Learning (DML). For the development of both methods, features are extracted from the intermediate layers of Convolutional Neural Networks and leveraged as frame descriptors, since they offer a compact and informative image representation, and lead to increased system efficiency. Extensive evaluation has been conducted on publicly available benchmark datasets, and the presented methods are compared with state-of-art approaches, achieving the best results in all evaluation setups.

show abstract

Section: Bag-of-words Approachmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Finding Near-Duplicate Videos in Large-Scale Collections

Kordopatis-Zilos

Papadopoulos

Patras

et al. 2019

Video Verification in the Fake News Era

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, the results show that the best performance is achieved when combining the deep feature descriptor with a global descriptor using Scalable Compressed Fisher Vectors (SCFV) [20]. Recently, an approach for using features from intermediate CNN layers for near-duplicate video retrieval has been proposed [21], showing that the additionally preserved structural information improves matching performance.…”

Section: Related Workmentioning

confidence: 99%

Temporal Compression and Fast Matching of Hand-Crafted and Deep Features of Video Segments

Bailer

Wechtitsch

2018

2018 International Conference on Content-Based Multimedia Indexing (CBMI)

View full text Add to dashboard Cite

In order to enable efficient instance search in video, compact descriptors for video segments have been proposed. They exploit the temporal redundancy within a video segment to obtain smaller descriptors, and the segment-based representation can be exploited to enable more efficient matching. In this paper we analyze the performance of different visual features when applying both lossless and lossy compression to the set of descriptors of one video segment. We consider both handcrafted and deep features, i.e., visual features obtained from training a deep convolutional neural network. We also propose optimizations to the extraction and matching procedure. Both the compression methods and the optimizations are experimentally evaluated on a large video data set.

show abstract

“…This leads to specialized solutions that typically exhibit poor performance when used (without tuning) on different video corpora. For instance, some methods learn codebooks [24,1,4,14] or hashing functions [25,26,7] based on sample frames from the evaluation dataset, and as a result their reported retrieval performance is often exaggerated.…”

Section: Introductionmentioning

confidence: 99%

“…Motivated by the excellent performance of deep learning in a wide variety of multimedia problems, we are proposing a video-level NDVR approach that incorporates deep learning in two steps. First, we use CNN features from intermediate convolution layers based on a well-known scheme called Maximum Activation of Convolutions [22,34,21], which was recently used for NDVR and led to improved results [14]. Second, we leverage a Deep Metric Learning (DML) framework based on a triplet-wise scheme, which has been shown to be effective in a variety of cases [2,30,29].…”

Section: Introductionmentioning

confidence: 99%

Near-Duplicate Video Retrieval with Deep Metric Learning

Kordopatis-Zilos¹,

Papadopoulos²,

Patras³

et al. 2017

2017 IEEE International Conference on Computer Vision Workshops (ICCVW)

Self Cite

View full text Add to dashboard Cite

This work addresses the problem of Near-Duplicate Video Retrieval (NDVR). We propose an effective videolevel NDVR scheme based on deep metric learning that leverages Convolutional Neural Network (CNN) features from intermediate layers to generate discriminative global video representations in tandem with a Deep Metric Learning (DML) framework with two fusion variations, trained to approximate an embedding function for accurate distance calculation between two near-duplicate videos. In contrast to most state-of-the-art methods, which exploit information deriving from the same source of data for both development and evaluation (which usually results to dataset-specific solutions), the proposed model is fed during training with sampled triplets generated from an independent dataset and is thoroughly tested on the widely used CC WEB VIDEO dataset, using two popular deep CNN architectures (AlexNet, GoogleNet). We demonstrate that the proposed approach achieves outstanding performance against the state-of-the-art, either with or without access to the evaluation dataset.

show abstract

Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers

Cited by 60 publications

References 19 publications

Finding Near-Duplicate Videos in Large-Scale Collections

Finding Near-Duplicate Videos in Large-Scale Collections

Temporal Compression and Fast Matching of Hand-Crafted and Deep Features of Video Segments

Near-Duplicate Video Retrieval with Deep Metric Learning

Contact Info

Product

Resources

About