Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.
We describe an image based rendering approach that generalizes many image based rendering algorithms currently in use including light field rendering and view-dependent texture mapping. In particular it allows for lumigraph style rendering from a set of input cameras that are not restricted to a plane or to any specific manifold. In the case of regular and planar input camera positions, our algorithm reduces to a typical lumigraph approach. In the case of fewer cameras and good approximate geometry, our algorithm behaves like view-dependent texture mapping. Our algorithm achieves this flexibility because it is designed to meet a set of desirable goals that we describe. We demonstrate this flexibility with a variety of examples.
In this paper, we describe an efficient image-based approach to computing and shading visual hulls from silhouette image data. Our algorithm takes advantage of epipolar geometry and incremental computation to achieve a constant rendering cost per rendered pixel. It does not suffer from the computation complexity, limited resolution, or quantization artifacts of previous volumetric approaches. We demonstrate the use of this algorithm in a real-time virtualized reality application running off a small number of video streams.
Determining shape from stereo has often been posed as a global minimization problem. Once formulated, the minimization problems are then solved with a variety of algorithmic approaches. These approaches include techniques such as dynamic programming min-cut and alpha-expansion. In this paper we show how an algorithmic technique that constructs a discrete spatial minimal cost surface can be brought to bear on stereo global minimization problems. This problem can then be reduced to a single min-cut problem. We use this approach to solve a new global minimization problem that naturally arises when solving for three-camera (trinocular) stereo. Our formulation treats the three cameras symmetrically, while imposing a natural occlusion cost and uniqueness constraint.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.