Background subtraction is an important task in computer vision. Traditional approaches usually utilize low-level visual features like color, texture, or edge to build background models. Due to the lack of deep features, they often achieve poor performance when facing complex video scenes such as illumination changes, background, or camera motions, camouflage effects and shadows. Recently, deep learning has shown to perform well in extracting deep features. To improve the robustness of background subtraction, in this paper, we propose an end-to-end multi-scale spatio-temporal (MS-ST) method which is able to extract deep features from video sequences. First, a video clip is input into a convolutional neural network for extracting multi-scale spatial features. Subsequently, to exploit the temporal information, we combine temporal sampling operations and ConvLSTM modules to extract the multi-scale temporal contextual information. Finally, the segmentation result is generated by fusing multi-scale spatio-temporal features. The experimental results on the CDnet-2014 dataset and the LASIESTA dataset demonstrate the effectiveness and superiority of the proposed method.INDEX TERMS Background subtraction, ConvLSTM, convolutional neural network, computer vision.
I. INTRODUCTIONBackground subtraction is an important task in the computer vision domain and it plays a fundamental role in many applications such as automatic drive [1], object tracking [2], crowd analysis [3], traffic analytics [4], and automated anomaly detection [5] in video surveillance. Background subtraction is the process of distinguishing moving objects, defined as foreground, from a given scene. This task can be regarded as a binary-classification task.The hurdle of background subtraction is to design an algorithm that can distinguish significant changes from noise-produced changes. The noise-produced changes are caused by illumination variation, weather, background motion, shadows, intermittent object motion, object shelter, intermittent object motion and camera motion. To detect significant changes accurately, numerous algorithms have been proposed in the last two decades. Some typical algorithms, such as GMM [6], KDE [7], Vibe [8], and PBAS [9], are based on background model to separate moving objects from background scene. Some other algorithms [10]-[18] employ discriminative hand-crafted features to improve the robustness.The associate editor coordinating the review of this manuscript and approving it for publication was Simone Bianco.