In this paper, we propose robust segmentation of foreground objects, such as human regions from sparsely arranged multi-view cameras. This work is intended for an immersive telepresence system. The system can be realized by segmenting a conferee (i.e. the foreground) from captured video at each conference site and then synthesizing life-sized textures with the background of another space. However, segmentation is a very challenging problem where the background has a similar texture to the foreground or the illumination varies according to time changes. Actually, segmentation accuracies of conventional methods are not sufficient to realize the telepresence system. The proposed method achieves sufficient segmentation quality to realize the system by directly estimating the foreground regions in a three-dimensional space based on the object existence probability for an individual camera and the color similarity among multiple cameras. Experimental results showed the effectiveness of the proposed method regarding foreground segmentation accuracy. Furthermore, confirmation was made that the experience of motion parallax for head movement could be naturally realized.
I. INTRODUCTIONOur motivation for this study is to realize a telepresence system [1] without dedicated equipment based on sparsely arranged multi-view cameras. In the system, the region of an attendee in each conference site is extracted accurately from multi-view video sequences. Then, the segmented texture from respective camera is naturally synthesized to the background of another site as illustrated in Fig. 1.The European FP7 3DPresence project aimed to build a multi-view and multi-user 3D videoconferencing system. In the project, some research activities that cut out the attendees from the real scene and virtually synthesized them into the background of another 3D space were reported [2]. The major challenge of these activities was the generation of high quality depth maps or reconstruction of accurate 3D models of human regions. The main approach of depth map generation is disparity estimation techniques based on stereo block matching [3], while the main approach by 3D model reconstruction is volumetric reconstruction techniques based on the shape from silhouette algorithms [4]. For the stereo block matching method, the depth estimation quality fully depends on camera intervals, and long camera intervals would introduce artifacts. On the other hand, the shape from silhouettes algorithms could stably reconstruct high quality 3D models using multiple cameras with long intervals.