Revisiting Video Saliency: A Large-Scale Benchmark and a New Model

Yang, Yi; Shen, Jianbing; Guo, Fang; Cheng, Ming–Ming; Borji, Ali

doi:10.1109/cvpr.2018.00514

Cited by 222 publications

(231 citation statements)

References 64 publications

Supporting

Mentioning

222

Contrasting

Unclassified

Order By: Relevance

“…Metric NSS CC SIM AUC-J s-AUC GBVS 1.775 0.331 0.201 0.855 0.592 SALICON [20] 1.901 0.327 0.232 0.857 0.590 OM-CNN [19] 1.911 0.344 0.256 0.856 0.583 DVA [38] 2.013 0.358 0.262 0.860 0.595 SalGAN [29] 2.043 0.370 0.262 0.866 0.709 ACLNet [39] 2.354 0.434 0.315 0.890 0.601 TASED-Net 2.667 0.470 0.361 0.895 0.712 itative results of our model and ACLNet for the better and worse cases are given in Figure 5 (see Supplementary material for more examples of qualitative results). As shown in (a) and (b) in Figure 5, TASED-Net seems highly sensitive to salient moving objects and less sensitive to background objects, which is consistent with the goal of video saliency in general.…”

Section: Methodsmentioning

confidence: 99%

“…Since the ground-truth annotations for the test set of DHF1K [39] are hidden for fair comparison, we first evaluate variants of our model on the validation set. The performance of TASED-Net with different T and temporal aggregation strategies are compared in Table 1.…”

Section: Evaluation On Dhf1kmentioning

confidence: 99%

“…The results indicate that TASED-Net with T = 32 and late two-step aggregation performs the best since this configuration achieves the best performance across most metrics (it has 21.2M Params and 63.2G FLOPs; more results on different T 's are provided in Section 4.5). We believe that late two-step aggregation performs better than early two-step aggregation because the feature maps used in spatial upscaling have a Table 1: Performance comparison of TASED-Net with different T s (shown in parentheses) and temporal aggregation strategies on the validation set of DHF1K [39]. The late two-step approach performs the best since it utilizes temporally rich features while avoiding overfitting.…”

Section: Evaluation On Dhf1kmentioning

confidence: 99%

“…For the rest of the paper, we report the performance of TASED-Net with T = 32, late two-step aggregation, and pre-training. Next, we submitted our results to the DHF1K online benchmark [39]. The performance of TASED-Net and previous state-of-the-art methods on the test set of DHF1K is reported in Table 2.…”

Section: Evaluation On Dhf1kmentioning

confidence: 99%

“…Our model outperforms other methods by a wide margin across all evaluation metrics. We note that ACLNet [39], the leading state-of-the-art method, is arguably better-primed for saliency detection than TASED-Net-it has a component pre-trained on an image-saliency dataset, SALICON [20], whereas we pre-train the encoder network of TASED-Net on an action recognition dataset. The higher performance of TASED-Net suggests that pretraining on a large-scale video dataset plays a significant role in performing well on other tasks in general.…”

Section: Evaluation On Dhf1kmentioning

confidence: 99%

See 4 more Smart Citations

TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection

Min

Corso

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

129

107

View full text Add to dashboard Cite

TASED-Net is a 3D fully-convolutional network architecture for video saliency detection. It consists of two building blocks: first, the encoder network extracts lowresolution spatiotemporal features from an input clip of several consecutive frames, and then the following prediction network decodes the encoded features spatially while aggregating all the temporal information. As a result, a single prediction map is produced from an input clip of multiple frames. Frame-wise saliency maps can be predicted by applying TASED-Net in a sliding-window fashion to a video. The proposed approach assumes that the saliency map of any frame can be predicted by considering a limited number of past frames. The results of our extensive experiments on video saliency detection validate this assumption and demonstrate that our fully-convolutional model with temporal aggregation method is effective. TASED-Net significantly outperforms previous state-of-the-art approaches on all three major large-scale datasets of video saliency detection: DHF1K, Hollywood2, and UCFSports. After analyzing the results qualitatively, we observe that our model is especially better at attending to salient moving objects.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Evaluation On Dhf1kmentioning

confidence: 99%

Section: Evaluation On Dhf1kmentioning

confidence: 99%

Section: Evaluation On Dhf1kmentioning

confidence: 99%

Section: Evaluation On Dhf1kmentioning

confidence: 99%

See 3 more Smart Citations

TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection

Min

Corso

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

129

107

View full text Add to dashboard Cite

show abstract

Learning based versus heuristic based: A comparative analysis of visual saliency prediction in immersive virtual reality

Aşkın

Çelikcan

2022

Computer Animation & Virtual

View full text Add to dashboard Cite

While visual saliency has been used for various purposes in virtual reality (VR), the efforts to properly understand the saliency mechanism in VR remain insufficient. In this paper, we present an extensive comparative analysis of learning-based and heuristic-based approaches to visual saliency prediction in immersive VR experienced using head-mounted-displays with a particular focus on the contribution of the depth cue. To this end, we use three learning-based RGB-D image saliency detection methods and two heuristic-based RGB-D image saliency detection methods on a VR dataset curated from three distinct virtual environments under two-dimensional and three-dimensional viewing conditions. Additionally, we extend the analysis by including a heuristic-based RGB video saliency detection method and its depth-infused version. The results acquired using these seven methods reveal the superiority of the learning-based RGB-D image saliency prediction methods in VR and validate the importance of the depth cue in the saliency prediction of virtual environments.

show abstract