Rethinking 360° Image Visual Attention Modelling with Unsupervised Learning

Djilali, Yasser Abdelaziz Dahou; Krishna, Tarun; McGuinness, Kevin; O’Connor, Noel E.

doi:10.1109/iccv48922.2021.01513

Cited by 25 publications

(5 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Self-supervision has become the new norm for learning representations given its ability to exploit unlabelled data [59,23,15,2,5,81,4,9,60,39,14]. Recent approaches devised for video understanding can be divided into two categories based on the SSL objective, namely pretext task based and contrastive learning based.…”

Section: Ssl For Video Representation Learningmentioning

confidence: 99%

Motion Aware Self-Supervision for Generic Event Boundary Detection

Rai¹,

Krishna²,

Dietlmeier³

et al. 2022

Preprint

View full text Add to dashboard Cite

The task of Generic Event Boundary Detection (GEBD) aims to detect moments in videos that are naturally perceived by humans as generic and taxonomy-free event boundaries. Modeling the dynamically evolving temporal and spatial changes in a video makes GEBD a difficult problem to solve. Existing approaches involve very complex and sophisticated pipelines in terms of architectural design choices, hence creating a need for more straightforward and simplified approaches. In this work, we address this issue by revisiting a simple and effective self-supervised method and augment it with a differentiable motion feature learning module to tackle the spatial and temporal diversities in the GEBD task. We perform extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets to demonstrate the efficacy of the proposed approach compared to the other self-supervised state-of-the-art methods. We also show that this simple self-supervised approach learns motion features without any explicit motion-specific pretext task. Our results can be reproduced on github.

show abstract

Section: Ssl For Video Representation Learningmentioning

confidence: 99%

Motion Aware Self-Supervision for Generic Event Boundary Detection

Rai¹,

Krishna²,

Dietlmeier³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In Table 1, we present the results of our SalViT360 model and the existing models. We evaluate the performance of SalViT360 with six state-of-the-art models for 360 • image and video saliency prediction, namely CP-360 [19], SalGAN360 [17], MV-SalGAN360 [20], Djilali et al [25], ATSal [21], PAVER [24]. ATSal, PAVER, and CP-360 are video saliency models; the rest are image-based models developed for the omnidirectional domain.…”

Section: Comparison With the State-of-the-artmentioning

confidence: 99%

“…Yun et al [24] use local undistorted patches with deformable CNNs and use a ViT variant for self-attention across space and time. Djilali et al [25] used a self-supervised pre-training based on learning the association between several different views of the same scene and trained a supervised decoder for 360 • saliency prediction as a downstream task. Although their approach considers the global relationship between viewports, it ignores the temporal dimension that is crucial for video understanding.…”

Section: Introductionmentioning

confidence: 99%

Leveraging Frequency Based Salient Spatial Sound Localization to Improve 360° Video Saliency Prediction

Cokelek

İmamoğlu

Özçinar

et al. 2021

2021 17th International Conference on Machine Vision and Applications (MVA)

View full text Add to dashboard Cite

Virtual and augmented reality (VR/AR) systems have dramatically gained in popularity with various application areas such as gaming, social media, and communication. It is therefore a crucial task to have the know-how to efficiently utilize, store or deliver 360 • videos for end-users. Towards this aim, researchers have been developing deep neural network models for 360 • multimedia processing and computer vision fields. In this line of work, an important research direction is to build models that can learn and predict the observers' attention on 360 • videos to obtain so-called saliency maps computationally. Although there are a few saliency models proposed for this purpose, these models generally consider only visual cues in video frames by neglecting audio cues from sound sources. In this study, an unsupervised frequency-based saliency model is presented for predicting the strength and location of saliency in spatial audio. The prediction of salient audio cues is then used as audio bias on the video saliency predictions of state-of-the-art models. Our experiments yield promising results and show that integrating the proposed spatial audio bias into the existing video saliency models consistently improves their performances.

show abstract

“…2 Related Works 2.1 Self-Supervised Representation Learning SSL has recently matched the performance of supervised learning on several computer vision benchmarks [Chen et al, 2020, Djilali et al, 2021, Bachman et al, 2019, Grill et al, 2020. Contrastive Learning.…”

Section: Introductionmentioning

confidence: 99%

Dynamic Channel Selection in Self-Supervised Learning

Krishna¹,

Rai²,

Djilali³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Whilst computer vision models built using self-supervised approaches are now commonplace, some important questions remain. Do self-supervised models learn highly redundant channel features? What if a self-supervised network could dynamically select the important channels and get rid of the unnecessary ones? Currently, convnets pre-trained with self-supervision have obtained comparable performance on downstream tasks in comparison to their supervised counterparts in computer vision. However, there are drawbacks to self-supervised models including their large numbers of parameters, computationally expensive training strategies and a clear need for faster inference on downstream tasks. In this work, our goal is to address the latter by studying how a standard channel selection method developed for supervised learning can be applied to networks trained with self-supervision. We validate our findings on a range of target budgets t d for channel computation on image classification task across different datasets, specifically CIFAR-10, CIFAR-100, and ImageNet-100, obtaining comparable performance to that of the original network when selecting all channels but at a significant reduction in computation reported in terms of FLOPs.

show abstract

Rethinking 360° Image Visual Attention Modelling with Unsupervised Learning

Cited by 25 publications

References 48 publications

Motion Aware Self-Supervision for Generic Event Boundary Detection

Motion Aware Self-Supervision for Generic Event Boundary Detection

Leveraging Frequency Based Salient Spatial Sound Localization to Improve 360° Video Saliency Prediction

Dynamic Channel Selection in Self-Supervised Learning

Contact Info

Product

Resources

About