Privacy-preserving high-quality people detection is a vital computer vision task for various indoor scenarios, e.g. people counting, customer behavior analysis, ambient assisted living or smart homes. In this work a novel approach for people detection in multiple overlapping depth images is proposed. We present a probabilistic framework utilizing a generative scene model to jointly exploit the multi-view image evidence, allowing us to detect people from arbitrary viewpoints. Our approach makes use of meanfield variational inference to not only estimate the maximum a posteriori (MAP) state but to also approximate the posterior probability distribution of people present in the scene. Evaluation shows state-of-the-art results on a novel data set for indoor people detection and tracking in depth images from the top-view with high perspective distortions. Furthermore it can be demonstrated that our approach (compared to the the monoview setup) successfully exploits the multi-view image evidence and robustly converges in only a few iterations. INDEX TERMS Depth sensor indoor surveillance, depth sensor networks, generative scene model, joint multi-view person detection, mean-field variational inference, multi-camera person detection, people detection in top-view, vertical top-view pedestrian detection.