Abstract:We introduce a simple modification of local image descriptors, such as SIFT, based on pooling gradient orientations across different domain sizes, in addition to spatial locations. The resulting descriptor, which we call DSP-SIFT, outperforms other methods in wide-baseline matching benchmarks, including those based on convolutional neural networks, despite having the same dimension of SIFT and requiring no training.
“…Other methods have directly learned a similarity measure for comparing patches using a convolutional similarity network [19,51,41,50]. Even though CNN-based descriptors encode a discriminative structure with a deep architecture, they have inherent limitations in handling large intra-class variations [41,10]. Furthermore, they are mostly tailored to estimate sparse correspondences, and cannot in practice provide dense descriptors due to their high computational complexity.…”
We present a descriptor, called fully convolutional selfsimilarity (FCSS), for dense semantic correspondence. To robustly match points among different instances within the same object class, we formulate FCSS using local selfsimilarity (LSS) within a fully convolutional network. In contrast to existing CNN-based descriptors, FCSS is inherently insensitive to intra-class appearance variations because of its LSS-based structure, while maintaining the precise localization ability of deep neural networks. The sampling patterns of local structure and the self-similarity measure are jointly learned within the proposed network in an end-to-end and multi-scale manner. As training data for semantic correspondence is rather limited, we propose to leverage object candidate priors provided in existing image datasets and also correspondence consistency between object pairs to enable weakly-supervised learning. Experiments demonstrate that FCSS outperforms conventional handcrafted descriptors and CNN-based descriptors on various benchmarks.
“…Other methods have directly learned a similarity measure for comparing patches using a convolutional similarity network [19,51,41,50]. Even though CNN-based descriptors encode a discriminative structure with a deep architecture, they have inherent limitations in handling large intra-class variations [41,10]. Furthermore, they are mostly tailored to estimate sparse correspondences, and cannot in practice provide dense descriptors due to their high computational complexity.…”
We present a descriptor, called fully convolutional selfsimilarity (FCSS), for dense semantic correspondence. To robustly match points among different instances within the same object class, we formulate FCSS using local selfsimilarity (LSS) within a fully convolutional network. In contrast to existing CNN-based descriptors, FCSS is inherently insensitive to intra-class appearance variations because of its LSS-based structure, while maintaining the precise localization ability of deep neural networks. The sampling patterns of local structure and the self-similarity measure are jointly learned within the proposed network in an end-to-end and multi-scale manner. As training data for semantic correspondence is rather limited, we propose to leverage object candidate priors provided in existing image datasets and also correspondence consistency between object pairs to enable weakly-supervised learning. Experiments demonstrate that FCSS outperforms conventional handcrafted descriptors and CNN-based descriptors on various benchmarks.
“…In this paper we propose to use DSP-SIFT (Domain Size Pooling SIFT) [8] feature to match the two images from different camera views. In the construction of 3D image the following objects can be chosen as matching unit: zero-crossings, edge and line fragments, linear features, object boundaries, point of interest.…”
Section: Improved Feature Matching In Image Reconstructionmentioning
confidence: 99%
“…In the process of DSP-SIFT, pooling occurs across different domain sizes [8], patches of different sizes are re-scaled, gradient orientation is computed and pooled across locations and scales. The resulting descriptor has the same dimension of ordinary SIFT.…”
Section: Improved Feature Matching In Image Reconstructionmentioning
confidence: 99%
“…The DSP-SIFT [8] feature outperforms the traditional SIFT feature in many aspects. The accuracy and the speed of 3D image reconstruction using binocular vision are significantly improved.…”
Abstract. In this paper we propose a novel method of 3D image reconstruction using domain-sized pooling SIFT features. First, we introduce the basic principles of binocular vision. Second, we apply the state-of-the-art image feature descriptor DSP-SIFT to the feature matching problem in 3D reconstruction. Third, we tested the image matching in the 3D reconstruction system. We provide the layout and error measurement of the binocular vision based 3D reconstruction system. Finally, the experimental results are presented. The speed and accuracy are satisfactory.
“…Observed images then can correct deformation regarding IOP for higher accuracy matching process. Domain-size pooling scale-invariant feature transform (DSP-SIFT) (Dong and Soatto, 2015) improve the robustness of point-based descriptor by pooling gradient orientations across different domain sizes. This kept the dimension remains the same but more appropriate for against photometric nuisances.…”
ABSTRACT:The primary method for geo-localization is based on GPS which has issues of localization accuracy, power consumption, and unavailability. This paper proposes a novel approach to geo-localization in a GPS-denied environment for a mobile platform. Our approach has two principal components: public domain transport network data available in GIS databases or OpenStreetMap; and a trajectory of a mobile platform. This trajectory is estimated using visual odometry and 3D view geometry. The transport map information is abstracted as a graph data structure, where various types of roads are modelled as graph edges and typically intersections are modelled as graph nodes. A search for the trajectory in real time in the graph yields the geo-location of the mobile platform. Our approach uses a simple visual sensor and it has a low memory and computational footprint. In this paper, we demonstrate our method for trajectory estimation and provide examples of geolocalization using public-domain map data. With the rapid proliferation of visual sensors as part of automated driving technology and continuous growth in public domain map data, our approach has the potential to completely augment, or even supplant, GPS based navigation since it functions in all environments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.