“…Self-supervision has become the new norm for learning representations given its ability to exploit unlabelled data [59,23,15,2,5,81,4,9,60,39,14]. Recent approaches devised for video understanding can be divided into two categories based on the SSL objective, namely pretext task based and contrastive learning based.…”
Section: Ssl For Video Representation Learningmentioning
The task of Generic Event Boundary Detection (GEBD) aims to detect moments in videos that are naturally perceived by humans as generic and taxonomy-free event boundaries. Modeling the dynamically evolving temporal and spatial changes in a video makes GEBD a difficult problem to solve. Existing approaches involve very complex and sophisticated pipelines in terms of architectural design choices, hence creating a need for more straightforward and simplified approaches. In this work, we address this issue by revisiting a simple and effective self-supervised method and augment it with a differentiable motion feature learning module to tackle the spatial and temporal diversities in the GEBD task. We perform extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets to demonstrate the efficacy of the proposed approach compared to the other self-supervised state-of-the-art methods. We also show that this simple self-supervised approach learns motion features without any explicit motion-specific pretext task. Our results can be reproduced on github.
“…Self-supervision has become the new norm for learning representations given its ability to exploit unlabelled data [59,23,15,2,5,81,4,9,60,39,14]. Recent approaches devised for video understanding can be divided into two categories based on the SSL objective, namely pretext task based and contrastive learning based.…”
Section: Ssl For Video Representation Learningmentioning
The task of Generic Event Boundary Detection (GEBD) aims to detect moments in videos that are naturally perceived by humans as generic and taxonomy-free event boundaries. Modeling the dynamically evolving temporal and spatial changes in a video makes GEBD a difficult problem to solve. Existing approaches involve very complex and sophisticated pipelines in terms of architectural design choices, hence creating a need for more straightforward and simplified approaches. In this work, we address this issue by revisiting a simple and effective self-supervised method and augment it with a differentiable motion feature learning module to tackle the spatial and temporal diversities in the GEBD task. We perform extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets to demonstrate the efficacy of the proposed approach compared to the other self-supervised state-of-the-art methods. We also show that this simple self-supervised approach learns motion features without any explicit motion-specific pretext task. Our results can be reproduced on github.
“…In Table 1, we present the results of our SalViT360 model and the existing models. We evaluate the performance of SalViT360 with six state-of-the-art models for 360 • image and video saliency prediction, namely CP-360 [19], SalGAN360 [17], MV-SalGAN360 [20], Djilali et al [25], ATSal [21], PAVER [24]. ATSal, PAVER, and CP-360 are video saliency models; the rest are image-based models developed for the omnidirectional domain.…”
Section: Comparison With the State-of-the-artmentioning
confidence: 99%
“…Yun et al [24] use local undistorted patches with deformable CNNs and use a ViT variant for self-attention across space and time. Djilali et al [25] used a self-supervised pre-training based on learning the association between several different views of the same scene and trained a supervised decoder for 360 • saliency prediction as a downstream task. Although their approach considers the global relationship between viewports, it ignores the temporal dimension that is crucial for video understanding.…”
Virtual and augmented reality (VR/AR) systems have dramatically gained in popularity with various application areas such as gaming, social media, and communication. It is therefore a crucial task to have the know-how to efficiently utilize, store or deliver 360 • videos for end-users. Towards this aim, researchers have been developing deep neural network models for 360 • multimedia processing and computer vision fields. In this line of work, an important research direction is to build models that can learn and predict the observers' attention on 360 • videos to obtain so-called saliency maps computationally. Although there are a few saliency models proposed for this purpose, these models generally consider only visual cues in video frames by neglecting audio cues from sound sources. In this study, an unsupervised frequency-based saliency model is presented for predicting the strength and location of saliency in spatial audio. The prediction of salient audio cues is then used as audio bias on the video saliency predictions of state-of-the-art models. Our experiments yield promising results and show that integrating the proposed spatial audio bias into the existing video saliency models consistently improves their performances.
“…2 Related Works 2.1 Self-Supervised Representation Learning SSL has recently matched the performance of supervised learning on several computer vision benchmarks [Chen et al, 2020, Djilali et al, 2021, Bachman et al, 2019, Grill et al, 2020. Contrastive Learning.…”
Whilst computer vision models built using self-supervised approaches are now commonplace, some important questions remain. Do self-supervised models learn highly redundant channel features? What if a self-supervised network could dynamically select the important channels and get rid of the unnecessary ones? Currently, convnets pre-trained with self-supervision have obtained comparable performance on downstream tasks in comparison to their supervised counterparts in computer vision. However, there are drawbacks to self-supervised models including their large numbers of parameters, computationally expensive training strategies and a clear need for faster inference on downstream tasks. In this work, our goal is to address the latter by studying how a standard channel selection method developed for supervised learning can be applied to networks trained with self-supervision. We validate our findings on a range of target budgets t d for channel computation on image classification task across different datasets, specifically CIFAR-10, CIFAR-100, and ImageNet-100, obtaining comparable performance to that of the original network when selecting all channels but at a significant reduction in computation reported in terms of FLOPs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.