2021
DOI: 10.1109/access.2021.3063372
|View full text |Cite
|
Sign up to set email alerts
|

A Novel Spatio-Temporal 3D Convolutional Encoder-Decoder Network for Dynamic Saliency Prediction

Abstract: As human beings are living in an always changing environment, predicting saliency maps from dynamic visual stimulus is of importance for modeling human visual system. Compared with human behavior, recent models based on LSTM and 3DCNN are still not good enough due to the limitation in spatio-temporal feature representation. In this paper, a novel 3D convolutional encoder-decoder architecture is proposed for saliency prediction on dynamic scenes. The encoder consists of two subnetworks to extract both spatial a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 44 publications
(103 reference statements)
0
4
0
Order By: Relevance
“…It was one of the quickest approaches up to this point. They have given us knowledge on hyper saliency [43]. Convolutional neural networks are trained using manual algorithmic annotations of smooth pursuits, and the findings are developed with the aid of 26 dynamic saliency models that are freely available online.…”
Section: Literature Surveymentioning
confidence: 99%
“…It was one of the quickest approaches up to this point. They have given us knowledge on hyper saliency [43]. Convolutional neural networks are trained using manual algorithmic annotations of smooth pursuits, and the findings are developed with the aid of 26 dynamic saliency models that are freely available online.…”
Section: Literature Surveymentioning
confidence: 99%
“…RecSal [30] predicts multiple saliency maps in a multiobjective training framework. Recent works introduce more [4,18,22,37,42]; U-Net-like architecture, with features sharing between encoder and decoder [6,16,19,25]; Deep Layer Aggregation [39]; Hierarchical intermediate map aggregation [1,30,35].…”
Section: Related Workmentioning
confidence: 99%
“…Figure 3: A taxonomy of decoding strategies commonly employed in video saliency prediction. Subfigures(topleft, top-right, bottom-left, bottom-right): Independent encoder and decoder, with no feature sharing between the two paths[4,18,22,37,42]; U-Net-like architecture, with features sharing between encoder and decoder[6,16,19,25]; Deep Layer Aggregation[39]; Hierarchical intermediate map aggregation[1,30,35].…”
mentioning
confidence: 99%
“…It has a database named dynamic human fixation 1K (DHF1K) that helps in pointing out fixations that are needed during dynamic scene free viewing, then there is the attentive convolutional neural network-long short-term memory network (ACLNet) which has augmentations to the original convolutional neural network and long short-term memory (CNN-LSTM) model to enable fast end-to-end saliency learning. In this paper [21], [22] they have made some corrections in the smooth pursuits (SP) logic. It involves manual annotations of the SPs with fixation along the arithmetic points and SP salient locations by training slicing convolutional neural networks.…”
Section: Related Workmentioning
confidence: 99%