This paper addresses the problem of image segmentation with a reference distribution. Recent studies have shown that segmentation with global consistency measures outperforms conventional techniques based on pixel-wise measures. However, such global approaches require a precise distribution to obtain the correct extraction. To overcome this strict assumption, we propose a new approach in which the given reference distribution plays a guiding role in inferring the latent distribution and its consistent region. The inference is based on an assumption that the latent distribution resembles the distribution of the consistent region but is distinct from the distribution of the complement region. We state the problem as the minimization of an energy function consisting of global similarities based on the Bhattacharyya distance and then implement a novel iterated distribution matching process for jointly optimizing distribution and segmentation. We evaluate the proposed algorithm on the GrabCut dataset, and demonstrate the advantages of using our approach with various segmentation problems, including interactive segmentation, background subtraction, and co-segmentation.
The system described in this paper provides a real-time 3D visual experience by using an array of 64 video cameras and an integral photography display with 60 viewing directions. The live 3D scene in front of the camera array is reproduced by the full-color, full-parallax autostereoscopic display with interactive control of viewing parameters. The main technical challenge is fast and flexible conversion of the data from the 64 multicamera images to the integral photography format. Based on image-based rendering techniques, our conversion method first renders 60 novel images corresponding to the viewing directions of the display, and then arranges the rendered pixels to produce an integral photography image. For real-time processing on a single PC, all the conversion processes are implemented on a GPU with GPGPU techniques. The conversion method also allows a user to interactively control viewing parameters of the displayed image for reproducing the dynamic 3D scene with desirable parameters. This control is performed as a software process, without reconfiguring the hardware system, by changing the rendering parameters such as the convergence point of the rendering cameras and the interval between the viewpoints of the rendering cameras.
We propose a method of using a focal stack, i.e., a set of differently focused images, as the input for a novel light field display called a "tensor display." Although this display consists of only a few light attenuating layers located in front of a backlight, it can be viewed from many directions (angles) simultaneously without the resolution of each viewing direction being sacrificed. Conventionally, a transmittance pattern is calculated for each layer from a light field, namely, a set of dense multi-view images (typically dozens) that are to be observed from different directions. However, preparing such a massive amount of images is often cumbersome for real objects. We developed a method that does not require a complete light field as the input; instead, a focal stack composed of only a few differently focused images is directly transformed into layer patterns. Our method greatly reduces the cost of acquiring data while also maintaining the quality of the output light field. We validated the method with experiments using synthetic light field datasets and a focal stack acquired by an ordinary camera.
Thanks to the excellent learning capability of deep convolutional neural networks (CNN), monocular depth estimation using CNNs has achieved great success in recent years. However, depth estimation from a monocular image alone is essentially an ill-posed problem, and thus, it seems that this approach would have inherent vulnerabilities. To reveal this limitation, we propose a method of adversarial patch attack on monocular depth estimation. More specifically, we generate artificial patterns (adversarial patches) that can fool the target methods into estimating an incorrect depth for the regions where the patterns are placed. Our method can be implemented in the real world by physically placing the printed patterns in real scenes. We also analyze the behavior of monocular depth estimation under attacks by visualizing the activation levels of the intermediate layers and the regions potentially affected by the adversarial attack.
A light-field display provides not only binocular depth sensation but also natural motion parallax with respect to head motion, which invokes a strong feeling of immersion. Such a display can be implemented with a set of stacked layers, each of which has pixels that can carry out light-ray operations (multiplication and addition). With this structure, the appearance of the display varies over the observed directions (i.e., a light field is produced) because the light rays pass through different combinations of pixels depending on both the originating points and outgoing directions. To display a specific 3-D scene, these layer patterns should be optimized to produce a light field that is as close as possible to that produced by the target three-dimensional scene. To deepen the understanding for this type of light field display, we focused on two important factors: light-ray operations carried out using layers and optimization methods for the layer patterns. Specifically, we compared multiplicative and additive layers, which are optimized using analytical methods derived from mathematical optimization or faster data-driven methods implemented as convolutional neural networks (CNNs). We compared combinations within these two factors in terms of the accuracy of light-field reproduction and computation time. Our results indicate that multiplicative layers achieve better accuracy than additive ones, and CNN-based methods perform faster than the analytical ones. We suggest that the best choice in terms of the balance between accuracy and computation speed is using multiplicative layers optimized using a CNN-based method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.