SalSAC: A Video Saliency Prediction Model with Shuffled Attentions and Correlation-Based ConvLSTM

Wu, Xinyi; Wu, Zhenyao; Zhang, Jinglin; Ju, Lili; Wang, Song

doi:10.1609/aaai.v34i07.6927

Cited by 41 publications

(32 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…STRA-Net [11] adopts a dual-pathway architecture combining 2D ResNet50 with ConvLSTM, while the proposed STCED utilizes a dualpathway 3D ResNet50 as the encoder, which implicitly justifies the capability of 3DCNN. As shown in Table 3, all of these four models [11], [13]- [15] perform far behind STCED on the DHF1K test set, which verifies the effectiveness of the proposed model. 1 https://mmcheng.net/videosal/…”

Section: E Comparison With the State-of-the-artsupporting

confidence: 53%

“…Several models [7], [10], [13], [14] based on LSTM have only one data stream. In ACLNet [7], spatial features of each frame are extracted by a CNN with attention subnetwork.…”

Section: B Modern Dynamic Saliency Modelsmentioning

confidence: 99%

“…Linardos et al [10] estimate dynamic saliency by adopting both convolutional LSTM (ConvLSTM) and exponential moving average (EMA) [37] as the temporal modeling module, which is inserted into a VGG16 [38] based encoder-decoder network. Based on a VGG encoder-decoder backbone, SalSAC [13] proposes a shuffled attention module and correlation-based ConvLSTM to improve temporal modeling for dynamic saliency prediction. Chen et al [14] develop a spatiotemporal feature alignment network, which contains a multi-scale deformable convolutional alignment subnetwork and a bidirectional Bi-ConvLSTM subnetwork.…”

Section: B Modern Dynamic Saliency Modelsmentioning

confidence: 99%

“…Several sequences are illustrated with detailed analysis to show that STCED performs better than other models on two challenging cases, which are saliency allocation and shifting saliency focus when there are multiple instances in the scene. [13], TASED-Net [16], STRA-Net [11], SalEMA [10], ACLNet [7], DeepVS [9], STSConvNet [8], and models proposed by Sun et al [17], Chen et al [14], and Zhang et al [15]. For a fair comparison, all metrics are evaluated on the private test set of DHF1K and obtain from the benchmark website 1 .…”

Section: E Comparison With the State-of-the-artmentioning

confidence: 99%

“…The four models of STRA-Net [11], SalSAC [13], Chen et al [14], and Zhang et al [15] are based on architectures combining 2D CNN with ConvLSTM. STRA-Net [11] adopts a dual-pathway architecture combining 2D ResNet50 with ConvLSTM, while the proposed STCED utilizes a dualpathway 3D ResNet50 as the encoder, which implicitly justifies the capability of 3DCNN.…”

Section: E Comparison With the State-of-the-artmentioning

confidence: 99%

See 4 more Smart Citations

A Novel Spatio-Temporal 3D Convolutional Encoder-Decoder Network for Dynamic Saliency Prediction

Shi

2021

IEEE Access

View full text Add to dashboard Cite

As human beings are living in an always changing environment, predicting saliency maps from dynamic visual stimulus is of importance for modeling human visual system. Compared with human behavior, recent models based on LSTM and 3DCNN are still not good enough due to the limitation in spatio-temporal feature representation. In this paper, a novel 3D convolutional encoder-decoder architecture is proposed for saliency prediction on dynamic scenes. The encoder consists of two subnetworks to extract both spatial and temporal features in parallel with intermediate fusion, respectively. The saliency map is produced in decoder by firstly enlarging features in spatial dimensions and then aggregating temporal information. Specially designed structures can transfer pooling indices from encoder to decoder, which helps the generation of location-aware saliency maps. The proposed network can be trained and inferred in an end-to-end manner. Experimental results on benchmark DHF1K show that the proposed model achieves the state-of-the-art performance on key metrics including both normalized scanpath saliency and Pearson's correlation coefficient.

show abstract

Section: E Comparison With the State-of-the-artsupporting

confidence: 53%

“…Several models [7], [10], [13], [14] based on LSTM have only one data stream. In ACLNet [7], spatial features of each frame are extracted by a CNN with attention subnetwork.…”

Section: B Modern Dynamic Saliency Modelsmentioning

confidence: 99%