DeepVS: A Deep Learning Based Video Saliency Prediction Approach

Jiang, Lai; Xu, Mai; Li, Shipeng; Qiao, Maoying; Wang, Zulin

doi:10.1007/978-3-030-01264-9_37

Cited by 159 publications

(147 citation statements)

References 30 publications

Supporting

Mentioning

145

Contrasting

Unclassified

Order By: Relevance

“…-Static Unsupervised (SU): Itti [17], LeMeur [27], GBVS [12], SUN [42], Judd [20], Hou [14], RARE2012 [34], BMS [41], -Static Deep learning (SD): Salicon [16], DeepNet [33], ML-Net [6], SalGAN [32], -Dynamic Unsupervised (DU): Fang [8], OBDL [13], -Dynamic Machine learning (DM): PQFT [10], Rudoy [35], -Dynamic Deep learning (DD): DeepVS [19], ACL-Net [40], STSconvNet [1], FGRNE [28].…”

Section: Taxonomymentioning

confidence: 99%

“…Let us stress that only very few works address the temporal dimension in traditional and UAV videos. Methods that tackle the temporal dimension comprise hand-crafted motion features [10,35], network architecture fed with optical flow [1], possibly in a two-layer fashion [1,7], or Long Short-Term Memory (LSTM) architectures [2,19,40,28] to benefit from their memory functionality.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

How Well Current Saliency Prediction Models Perform on UAVs Videos?

Perrin

Zhang

Meur

2019

Computer Analysis of Images and Patterns

View full text Add to dashboard Cite

It is exciting to witness the fast development of Unmanned Aerial Vehicle (UAV) imaging which opens the door to many new applications. In view of developing rich and efficient services, we wondered which strategy should be adopted to predict salience in UAV videos. To that end, we introduce here a benchmark of off-the-shelf state-of-theart models for saliency prediction. This benchmark studies comprehensively two challenging aspects related to salience, namely the peculiar characteristics of UAV contents and the temporal dimension of videos. This paper enables to identify the strengths and weaknesses of current static, dynamic, supervised and unsupervised models for drone videos. Eventually, we highlight several strategies for the development of visual attention in UAV videos.

show abstract

Section: Taxonomymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

How Well Current Saliency Prediction Models Perform on UAVs Videos?

Perrin

Zhang

Meur

2019

Computer Analysis of Images and Patterns

View full text Add to dashboard Cite

show abstract

“…The work in [7] uses a 3D CNN to extract features, plus an LSTM network to expand the temporal span of the analysis. Other re-searchers use further additional modules, such as the attention mechanism [75] or object-to-motion sub-network [29].…”

Section: Saliency Predictionmentioning

confidence: 99%

“…For the most part, this issue is addressed inconsistently. The majority of the data sets either make no explicit mention of separating smooth pursuit from fixations (ASCMN [51], SFU [24], two Hollywood2-based sets [45,71], DHF1K [75]) or rely on the event detection built into the eye tracker, which in turn does not differentiate SP from fixations (TUD [4], USC CRCNS [13], CITIUS [39]), LEDOV [29]. IRCCyN/IVC (Video 1) [9] does not mention any eye movement types at all, while IRCCyN/IVC (Video 2) [18] only names SP in passing.…”

Section: Video Saliency Data Setsmentioning

confidence: 99%

Supersaliency: A Novel Pipeline for Predicting Smooth Pursuit-Based Attention Improves Generalisability of Video Saliency

Startsev

Dörr

2020

IEEE Access

View full text Add to dashboard Cite

Predicting attention is a popular topic at the intersection of human and computer vision. However, even though most of the available video saliency data sets and models claim to target human observers' fixations, they fail to differentiate them from smooth pursuit (SP), a major eye movement type that is unique to perception of dynamic scenes. In this work, we highlight the importance of SP and its prediction (which we call supersaliency, due to greater selectivity compared to fixations), and aim to make its distinction from fixations explicit for computational models. To this end, we (i) use algorithmic and manual annotations of SP and fixations for two well-established video saliency data sets, (ii) train Slicing Convolutional Neural Networks for saliency prediction on either fixation-or SP-salient locations, and (iii) evaluate our and 26 publicly available dynamic saliency models on three data sets against traditional saliency and supersaliency ground truth. Overall, our models outperform the state of the art in both the new supersaliency and the traditional saliency problem settings, for which literature models are optimized. Importantly, on two independent data sets, our supersaliency model shows greater generalization ability and outperforms all other models, even for fixation prediction.

show abstract

“…Long Short-term Memory (LSTM) networks have also been used for tracking visual saliency both in static images [12] and video stimuli [63]. In order to improve saliency estimation in videos, many approaches employ multi-stream networks, such as RGB/Optical Flow (OF) [3], RGB/OF/Depth [40], or multiple subnets such as objectness/motion [31] or saliency/gaze [22] pathways. Action Recognition: The work of [32] explored several approaches for fusing information over temporal dimension, while in [30] 3D spatio-temporal convolutions have been proposed, whose performance can be boosted when trained on large datasets [55,57] or employing ResNet architectures [25].…”

Section: Related Workmentioning

confidence: 99%

SUSiNet: See, Understand and Summarize It

Koutras

Maragos

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

View full text Add to dashboard Cite

In this work we propose a multi-task spatio-temporal network, called SUSiNet, that can jointly tackle the spatiotemporal problems of saliency estimation, action recognition and video summarization. Our approach employs a single network that is jointly end-to-end trained for all tasks with multiple and diverse datasets related to the exploring tasks. The proposed network uses a unified architecture that includes global and task specific layer and produces multiple output types, i.e., saliency maps or classification labels, by employing the same video input. Moreover, one additional contribution is that the proposed network can be deeply supervised through an attention module that is related to human attention as it is expressed by eye-tracking data. From the extensive evaluation, on seven different datasets, we have observed that the multi-task network performs as well as the state-of-the-art single-task methods (or in some cases better), while it requires less computational budget than having one independent network per each task.

show abstract

DeepVS: A Deep Learning Based Video Saliency Prediction Approach

Cited by 159 publications

References 30 publications

How Well Current Saliency Prediction Models Perform on UAVs Videos?

How Well Current Saliency Prediction Models Perform on UAVs Videos?

Supersaliency: A Novel Pipeline for Predicting Smooth Pursuit-Based Attention Improves Generalisability of Video Saliency

SUSiNet: See, Understand and Summarize It

Contact Info

Product

Resources

About