A Spatial-Temporal Recurrent Neural Network for Video Saliency Prediction

Zhang, Kao; Chen, Zhenzhong; Liu, Shan

doi:10.1109/tip.2020.3036749

Cited by 20 publications

(9 citation statements)

References 68 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…STRA-Net [11] adopts a dual-pathway architecture combining 2D ResNet50 with ConvLSTM, while the proposed STCED utilizes a dualpathway 3D ResNet50 as the encoder, which implicitly justifies the capability of 3DCNN. As shown in Table 3, all of these four models [11], [13]- [15] perform far behind STCED on the DHF1K test set, which verifies the effectiveness of the proposed model. 1 https://mmcheng.net/videosal/…”

Section: E Comparison With the State-of-the-artsupporting

confidence: 53%

“…The network uses two tightly coupled streams to extract appearance and motion features, and then a lightweight ConvGRU, an alternative to ConvLSTM, to model long-term temporal dependence. Zhang et al [15] propose a select and reweight fusion module to automatically weight spatial and temporal features from different domains to enhance the meaningful features and decrease less useful ones and integrate them. In order to consider interframe motion cues, they design an attentionaware ConvLSTM to predict the final salient region based on integrated features.…”

Section: B Modern Dynamic Saliency Modelsmentioning

confidence: 99%

“…Several sequences are illustrated with detailed analysis to show that STCED performs better than other models on two challenging cases, which are saliency allocation and shifting saliency focus when there are multiple instances in the scene. [13], TASED-Net [16], STRA-Net [11], SalEMA [10], ACLNet [7], DeepVS [9], STSConvNet [8], and models proposed by Sun et al [17], Chen et al [14], and Zhang et al [15]. For a fair comparison, all metrics are evaluated on the private test set of DHF1K and obtain from the benchmark website 1 .…”

Section: E Comparison With the State-of-the-artmentioning

confidence: 99%

“…The four models of STRA-Net [11], SalSAC [13], Chen et al [14], and Zhang et al [15] are based on architectures combining 2D CNN with ConvLSTM. STRA-Net [11] adopts a dual-pathway architecture combining 2D ResNet50 with ConvLSTM, while the proposed STCED utilizes a dualpathway 3D ResNet50 as the encoder, which implicitly justifies the capability of 3DCNN.…”

Section: E Comparison With the State-of-the-artmentioning

confidence: 99%

“…features related to saliency. Deep models based on LSTM or its variants [7]- [15] are difficult to train for long-range temporal evolution due to the frame-by-frame data feeding scheme. Models based on 3D fully CNN (3DCNN) [16], [17] still may fail to allocate saliency or to predict shifts of saliency focus when there are multiple instances.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

A Novel Spatio-Temporal 3D Convolutional Encoder-Decoder Network for Dynamic Saliency Prediction

Shi

2021

IEEE Access

View full text Add to dashboard Cite

As human beings are living in an always changing environment, predicting saliency maps from dynamic visual stimulus is of importance for modeling human visual system. Compared with human behavior, recent models based on LSTM and 3DCNN are still not good enough due to the limitation in spatio-temporal feature representation. In this paper, a novel 3D convolutional encoder-decoder architecture is proposed for saliency prediction on dynamic scenes. The encoder consists of two subnetworks to extract both spatial and temporal features in parallel with intermediate fusion, respectively. The saliency map is produced in decoder by firstly enlarging features in spatial dimensions and then aggregating temporal information. Specially designed structures can transfer pooling indices from encoder to decoder, which helps the generation of location-aware saliency maps. The proposed network can be trained and inferred in an end-to-end manner. Experimental results on benchmark DHF1K show that the proposed model achieves the state-of-the-art performance on key metrics including both normalized scanpath saliency and Pearson's correlation coefficient.

show abstract