A video sequence is more than a sequence of still images. It contains a strong spatial-temporal correlation between the regions of consecutive frames. The most important characteristic of videos is the perceived motion foreground objects across the frames. The motion of foreground objects dramatically changes the importance of the objects in a scene and leads to a different saliency map of the frame representing the scene. This makes the saliency analysis of videos much more complicated than that of still images. In this paper, we investigate saliency in video sequences and propose a novel spatiotemporal saliency model devoted for video surveillance applications. Compared to classical saliency models based on still images, such as Itti's model, and space-time saliency models, the proposed model is more correlated to visual saliency perception of surveillance videos. Both bottom-up and topdown attention mechanisms are involved in this model. Stationary saliency and motion saliency are, respectively, analyzed. First, a new method for background subtraction and foreground extraction is developed based on content analysis of the scene in the domain of video surveillance. Then, a stationary saliency model is setup based on multiple features computed from the foreground. Every feature is analyzed with a multi-scale Gaussian pyramid, and all the features conspicuity maps are combined using different weights. The stationary model integrates faces as a supplement feature to other low level features such as color, intensity and orientation. Second, a motion saliency map is calculated using the statistics of the motion vectors field. Third, both motion saliency map and stationary saliency map are merged based on center-surround framework defined by an approximated Gaussian function. The video saliency maps computed from our model have been compared to the gaze maps obtained from subjective experiments with SMI eye tracker for surveillance video sequences. The results show strong correlation between the output of the proposed spatiotemporal saliency model and the experimental gaze maps.
Abstract-When viewing video sequences, the human visual system (HVS) tends to focus on the active objects. These are perceived as the most salient regions in the scene. Additionally, human observers tend to predict the future positions of moving objects in a dynamic scene and to direct their gaze to these positions. In this paper we propose a saliency detection model that accounts for the motion in the sequence and predicts the positions of the salient objects in future frames. This is a novel technique for attention models that we call Predictive Saliency Map (PSM). PSM improves the consistency of the estimated saliency maps for video sequences. PSM uses both static information provided by static saliency maps (SSM) and motion vectors to predict future salient regions in the next frame. In this paper we focus only on surveillance videos therefore, in addition to low-level features such as intensity, color and orientation we consider high-level features such as faces as salient regions that attract naturally viewers attention. Saliency maps computed based on these static features are combined with motion saliency maps to account for saliency created by the activity in the scene. The predicted saliency map is computed using previous saliency maps and motion information. The PSMs are compared with the experimentally obtained gaze maps and saliency maps obtained using approaches from the literature. The experimental results show that our enhanced model yields higher ability to predict eye fixations in surveillance videos.
The perception of video is different from that of image because of the motion information in video. Motion objects lead to the difference between two neighboring frames which is usually focused on. By far, most papers have contributed to image saliency but seldom to video saliency. Based on scene understanding, a new video saliency detection model with multi-features is proposed in this paper. First, background is extracted based on binary tree searching, then main features in the foreground is analyzed using a multi-scale perception model. The perception model integrates faces as a high level feature, as a supplement to other low-level features such as color, intensity and orientation. Motion saliency map is calculated using the statistic of the motion vector field. Finally, multi-feature conspicuities are merged with different weights. Compared with the gaze map from subjective experiments, the output of the multi-feature based video saliency detection model is close to gaze map.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.