Visual tracking performance has long been limited by the lack of better appearance models. These models fail either where they tend to change rapidly, like in motion-based tracking, or where accurate information of the object may not be available, like in color camouflage (where background and foreground colors are similar). This paper proposes a robust, adaptive appearance model which works accurately in situations of color camouflage, even in the presence of complex natural objects. The proposed model includes depth as an additional feature in a hierarchical modular neural framework for online object tracking. The model adapts to the confusing appearance by identifying the stable property of depth between the target and the surrounding object(s). The depth complements the existing RGB features in scenarios when RGB features fail to adapt, hence becoming unstable over a long duration of time. The parameters of the model are learned efficiently in the Deep network, which consists of three modules: (1) The spatial attention layer, which discards the majority of the background by selecting a region containing the object of interest; (2) the appearance attention layer, which extracts appearance and spatial information about the tracked object; and (3) the state estimation layer, which enables the framework to predict future object appearance and location. Three different models were trained and tested to analyze the effect of depth along with RGB information. Also, a model is proposed to utilize only depth as a standalone input for tracking purposes. The proposed models were also evaluated in real-time using KinectV2 and showed very promising results. The results of our proposed network structures and their comparison with the state-of-the-art RGB tracking model demonstrate that adding depth significantly improves the accuracy of tracking in a more challenging environment (i.e., cluttered and camouflaged environments). Furthermore, the results of depth-based models showed that depth data can provide enough information for accurate tracking, even without RGB information.
Inspired by the notion of swarm robotics, sensing, and minimalism, in this paper, we study and analyze how a collection of only 1D depth scans can be used as a part of the minimum feature for human body detection and its segmentation in a point cloud. In relation to the traditional approaches which require a complete point cloud model representation for skeleton model reconstruction, our proposed approach offers a lower computation and power consumption, especially in sensor and robotic networks. Our main objective is to investigate if the reduced number of training data through a collection of 1D scans of a subject is related to the rate of recognition and if it can be used to accurately detect the human body and its posture. The method takes advantage of the frequency components of the depth images (here, we refer to it as a 1D scan). To coordinate a collection of these 1D scans obtained through a sensor network, we also proposed a sensor scheduling framework. The framework is evaluated using two stationary depth sensors and a mobile depth sensor. The performance of our method was analyzed through movements and posture details of a subject having two relative orientations with respect to the sensors with two classes of postures, namely, walking and standing. The novelty of the paper can be summarized in 3 main points. Firstly, unlike deep learning methods, our approach would require a smaller dataset for training. Secondly, our case studies show that the method uses very limited training dataset and still can detect the unseen situation and reasonably estimate the orientation and detail of the posture. Finally, we propose an online scheduler to improve the energy efficiency of the network sensor and minimize the number of sensors required for surveillance monitoring by employing a mobile sensor to recover the occluded views of the stationary sensors. We showed that with the training data captured on 1 m from the camera, the algorithm can detect the detailed posture of the subject from 1, 2, 3, and 4 meters away from the sensor during the walking and standing with average accuracy of 93% and for different orientation with respect to the sensor by 71% accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.