AWSD: Adaptive Weighted Spatiotemporal Distillation for Video Representation

Tavakolian, Mohammad; Tavakoli, Hamed R.; Hadid, Abdenour

doi:10.1109/iccv.2019.00811

“…A thorough review of the field is provided in Gou et al (2021). Knowledge Distillation has been employed on various computer vision problems, i.e., image classification (Yalniz et al, 2019;Touvron et al, 2020;Xie et al, 2020), object detection (Li et al, 2017;Shmelkov et al, 2017;Deng et al, 2019), metric learning (Park et al, 2019;Peng et al, 2019), action recognition (Garcia et al, 2018;Thoker & Gall, 2019;Stroud et al, 2020), video classification (Zhang & Peng, 2018;Bhardwaj et al, 2019), video captioning (Pan et al, 2020;Zhang et al, 2020), and representation learning (Tavakolian et al, 2019;Piergiovanni et al, 2020).…”

Section: Knowledge Distillationmentioning

confidence: 99%

DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval

Kordopatis-Zilos

¹

,

Tzelepis

²

,

Papadopoulos

³

et al. 2022

View full text Add to dashboard Cite

In this paper, we address the problem of high performance and computationally efficient content-based video retrieval in large-scale datasets. Current methods typically propose either: (i) fine-grained approaches employing spatio-temporal representations and similarity calculations, achieving high performance at a high computational cost or (ii) coarse-grained approaches representing/indexing videos as global vectors, where the spatio-temporal structure is lost, providing low performance but also having low computational cost. In this work, we propose a Knowledge Distillation framework, called Distill-and-Select (DnS), that starting from a well-performing fine-grained Teacher Network learns: (a) Student Networks at different retrieval performance and computational efficiency trade-offs and (b) a Selector Network that at test time rapidly directs samples to the appropriate student to maintain both high retrieval performance and high computational efficiency. We train several students with different architectures and arrive at different trade-offs of performance and efficiency, i.e., speed and storage requirements, including fine-grained students that store/index videos using binary representations. Importantly, the proposed scheme allows Knowledge Distillation in large, unlabelled datasets—this leads to good students. We evaluate DnS on five public datasets on three different video retrieval tasks and demonstrate (a) that our students achieve state-of-the-art performance in several cases and (b) that the DnS framework provides an excellent trade-off between retrieval performance, computational speed, and storage space. In specific configurations, the proposed method achieves similar mAP with the teacher but is 20 times faster and requires 240 times less storage space. The collected dataset and implementation are publicly available: https://github.com/mever-team/distill-and-select.

show abstract

“…A thorough review of the field is provided in (Gou et al, 2021). Knowledge Distillation has been employed on various computer vision problems, i.e., image classification (Yalniz et al, 2019;Touvron et al, 2020;Xie et al, 2020), object detection (Li et al, 2017;Shmelkov et al, 2017;Deng et al, 2019), metric learning (Park et al, 2019;Peng et al, 2019), action recognition (Garcia et al, 2018;Thoker and Gall, 2019;Stroud et al, 2020), video classification (Zhang and Peng, 2018;Bhardwaj et al, 2019), video captioning (Pan et al, 2020;Zhang et al, 2020), and representation learning (Tavakolian et al, 2019;Piergiovanni et al, 2020).…”

Section: Knowledge Distillationmentioning

confidence: 99%

DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval

Kordopatis-Zilos,

Tzelepis,

Papadopoulos

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we address the problem of high performance and computationally efficient content-based video retrieval in large-scale datasets. Current methods typically propose either: (i) finegrained approaches employing spatio-temporal representations and similarity calculations, achieving high performance at a high computational cost or (ii) coarse-grained approaches representing/indexing videos as global vectors, where the spatio-temporal structure is lost, providing low performance but also having low computational cost. In this work, we propose a Knowledge Distillation framework, which we call Distill-and-Select (DnS), that starting from a wellperforming fine-grained Teacher Network learns: a) Student Networks at different retrieval performance and computational efficiency trade-offs and b) a Selection Network that at test time rapidly directs samples to the appropriate student so as to maintain both high retrieval performance and high computational efficiency. We train several students with different architectures and arrive at different trade-offs of performance and efficiency, i.e., speed and storage requirements, including fine-grained students that store

show abstract

“…This work suggests that class probabilities, as "dark knowledge", are very useful to retain the performance of original network, and thus, light-weight substitute model could be trained to distill this knowledge. This approach is very useful and has been justified to solve a variety of complex application problems, such as pose estimation [37,46,33], lane detection [17], real-time streaming [31], object detection [6], video representation [41,10,11], and so forth. Furthermore, this approach is able to boost the performance of deep neural network with improvement on efficiency [35] and accuracy [25].…”

Section: Related Workmentioning

confidence: 99%

Neural Networks Are More Productive Teachers Than Human Raters: Active Mixup for Data-Efficient Knowledge Distillation from a Blackbox Model

Wang

¹

,

Li

²

,

Wang

³

et al. 2020

Preprint

View full text Add to dashboard Cite

We study how to train a student deep neural network for visual recognition by distilling knowledge from a blackbox teacher model in a data-efficient manner. Progress on this problem can significantly reduce the dependence on large-scale datasets for learning high-performing visual recognition models. There are two major challenges. One is that the number of queries into the teacher model should be minimized to save computational and/or financial costs. The other is that the number of images used for the knowledge distillation should be small; otherwise, it violates our expectation of reducing the dependence on large-scale datasets. To tackle these challenges, we propose an approach that blends mixup and active learning. The former effectively augments the few unlabeled images by a big pool of synthetic images sampled from the convex hull of the original images, and the latter actively chooses from the pool hard examples for the student neural network and query their labels from the teacher model. We validate our approach with extensive experiments. 1 .

show abstract

AWSD: Adaptive Weighted Spatiotemporal Distillation for Video Representation

Cited by 5 publications

References 32 publications

DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval

DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval

DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval

Neural Networks Are More Productive Teachers Than Human Raters: Active Mixup for Data-Efficient Knowledge Distillation from a Blackbox Model

Contact Info

Product

Resources

About