Spatio-Temporal Self-Attention Network for Video Saliency Prediction

Wang, Ziqiang; Li, Zhi; Li, Gongyang; Wang, Yang; Zhang, Tianhong; Xu, Li-Hua; Wang, Jijun

doi:10.1109/tmm.2021.3139743

Cited by 17 publications

(18 citation statements)

References 94 publications

(50 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By implementing psychophysically uncovered mechanisms of attentional and oculomotor control, ScanDy allows to generate sequences of eye movements for any visual scene. Recent years have shown a growing interest in the simulation of time-ordered fixation sequences for static scenes (Tatler, Brockmole, and Carpenter, 2017; Malem-Shinitski et al, 2020; Schwetlick, Rothkegel, et al, 2020; Schwetlick, Backhaus, and Engbert, 2022; Kucharsky et al, 2021; Kümmerer, Bethge, and Wallis, 2022), as well as the frame-wise prediction of where humans tend to look on average when observing a dynamic scene (Molin, Etienne-Cummings, and Niebur, 2015; Min and Corso, 2019; Droste, Jiao, and Noble, 2020; Wang, Liu, et al, 2021). We are currently not aware of another computational model that is able to simulate time-resolved gaze positions for the full duration of dynamic scenes, analogous to human eye tracking data.…”

Section: Discussionmentioning

confidence: 99%

“…With the availability of larger datasets in recent years (Marszalek, Laptev, and Schmid, 2009; Wang, Shen, et al, 2018), video saliency detection has also become a popular task in computer vision. Deep neural network (DNN) architectures, which include the temporal information in videos either through temporal recurrence (Linardos et al, 2019; Droste, Jiao, and Noble, 2020) or by using 3D convolutional networks (Min and Corso, 2019; Jain et al, 2021; Wang, Liu, et al, 2021), clearly outperform mechanistic models from computational neuroscience and psychology in predicting where humans tend to look. This boost in performance can be explained by the capabilities of these networks not just to encode information on low-level features like color or edges.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Objects guide human gaze behavior in dynamic real-world scenes

Roth

Rolfs

Hellwich

et al. 2023

Preprint

View full text Add to dashboard Cite

The complexity of natural scenes makes it challenging to experimentally study the mechanisms behind human gaze behavior when viewing dynamic environments. Historically, eye movements were believed to be driven primarily by bottom-up saliency, but increasing evidence suggests that objects also play a significant role in guiding attention. We present a new computational framework to investigate the importance of objects for attentional guidance. This framework is designed to simulate realistic scanpaths for dynamic real-world scenes, including saccade timing and smooth pursuit behavior. Individual model components are based on psychophysically uncovered mechanisms of visual attention and saccadic decision-making. All mechanisms are implemented in a modular fashion with a small number of well-interpretable parameters. To systematically analyze the importance of objects in guiding gaze behavior, we implemented four different models within this framework: two purely location-based models, where one is based on low-level saliency and one on high-level saliency, and two object-based models, with one incorporating low-level saliency for each object and the other one not using any saliency information. We optimized each model's parameters to reproduce the saccade amplitude and fixation duration distributions of human scanpaths using evolutionary algorithms. We compared model performance with respect to spatial and temporal fixation behavior, including the proportion of fixations exploring the background, as well as detecting, inspecting, and revisiting objects. A model with object-based attention and inhibition, which uses saliency information to prioritize between objects for saccadic selection, leads to scanpath statistics with the highest similarity to the human data. This demonstrates that scanpath models benefit from object-based attention and selection, suggesting that object-level attentional units play an important role in guiding attentional processing.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Objects guide human gaze behavior in dynamic real-world scenes

Roth

Rolfs

Hellwich

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

Section: Discussionmentioning

confidence: 99%

“…With the availability of larger datasets in recent years [ 34 , 35 ], video saliency detection has also become a popular task in computer vision. Deep neural network (DNN) architectures, which include the temporal information in videos either through temporal recurrence [ 36 , 37 ] or by using 3D convolutional networks [ 38 – 40 ], clearly outperform mechanistic models from computational neuroscience and psychology in predicting where humans tend to look. This boost in performance can be explained by the capabilities of these networks not just to encode information on low-level features like color or edges.…”

Section: Introductionmentioning

confidence: 99%

Objects guide human gaze behavior in dynamic real-world scenes

Roth,

Rolfs,

Hellwich

et al. 2023

PLoS Comput Biol

View full text Add to dashboard Cite

The complexity of natural scenes makes it challenging to experimentally study the mechanisms behind human gaze behavior when viewing dynamic environments. Historically, eye movements were believed to be driven primarily by space-based attention towards locations with salient features. Increasing evidence suggests, however, that visual attention does not select locations with high saliency but operates on attentional units given by the objects in the scene. We present a new computational framework to investigate the importance of objects for attentional guidance. This framework is designed to simulate realistic scanpaths for dynamic real-world scenes, including saccade timing and smooth pursuit behavior. Individual model components are based on psychophysically uncovered mechanisms of visual attention and saccadic decision-making. All mechanisms are implemented in a modular fashion with a small number of well-interpretable parameters. To systematically analyze the importance of objects in guiding gaze behavior, we implemented five different models within this framework: two purely spatial models, where one is based on low-level saliency and one on high-level saliency, two object-based models, with one incorporating low-level saliency for each object and the other one not using any saliency information, and a mixed model with object-based attention and selection but space-based inhibition of return. We optimized each model’s parameters to reproduce the saccade amplitude and fixation duration distributions of human scanpaths using evolutionary algorithms. We compared model performance with respect to spatial and temporal fixation behavior, including the proportion of fixations exploring the background, as well as detecting, inspecting, and returning to objects. A model with object-based attention and inhibition, which uses saliency information to prioritize between objects for saccadic selection, leads to scanpath statistics with the highest similarity to the human data. This demonstrates that scanpath models benefit from object-based attention and selection, suggesting that object-level attentional units play an important role in guiding attentional processing.

show abstract

“…This model jointly modified the GATs and the self-attention mechanism that fully dynamically focused and integrated spatial, temporal and periodic correlations. Wang et al [30] proposed a novel spatial-temporal self-attention 3D network (STSANet) for video prediction, which integrated self-attention into 3D convolutional network to perceive contextual contents in semantic and spatiotemporal subspaces and narrows semantic and spatiotemporal gaps during saliency feature fusion. Chaabane et al [31] used an adapted self-attention convolutional neural network to highlight the temporal evolution of land cover areas through the construction of a spatiotemporal map.…”

Section: B Attention Mechanism In Time Series Data Predictionmentioning

confidence: 99%

Dynamic Spatial-Temporal Graph Attention Graph Convolutional Network for Short-Term Traffic Flow Forecasting

Tang

Sun

2020

2020 IEEE International Symposium on Circuits and Systems (ISCAS)

View full text Add to dashboard Cite

Accurate traffic prediction in real time plays an important role in Intelligent Transportation System (ITS) and travel navigation guidance. There have been many attempts to predict short-term traffic status which consider the spatial and temporal dependencies of traffic information such as temporal graph convolutional network (T-GCN) model and convolutional long short-term memory (Conv-LSTM) model. However, most existing methods use simple adjacent matrix consisting of 0 and 1 to capture the spatial dependence which can not meticulously describe the urban road network topological structure and the law of dynamic change with time. In order to tackle the problem, this paper proposes a dynamic temporal self-attention graph convolutional network (DT-SGN) model which considers the adjacent matrix as a trainable attention score matrix and adapts network parameters to different inputs. Specially, selfattention graph convolutional network (SGN) is chosen to capture the spatial dependence and the dynamic gated recurrent unit (Dynamic-GRU) is chosen to capture temporal dependence and learn dynamic changes of input data. Experiments demonstrate the superiority of our method over state-of-art model-driven model and data-driven models on real-world traffic datasets.

show abstract

Spatio-Temporal Self-Attention Network for Video Saliency Prediction

Cited by 17 publications

References 94 publications

Objects guide human gaze behavior in dynamic real-world scenes

Objects guide human gaze behavior in dynamic real-world scenes

Objects guide human gaze behavior in dynamic real-world scenes

Dynamic Spatial-Temporal Graph Attention Graph Convolutional Network for Short-Term Traffic Flow Forecasting

Contact Info

Product

Resources

About