The goal of this article is twofold. In the first part, we survey a family of nearest neighbor algorithms that are based on the concept of localitysensitive hashing. Many of these algorithm have already been successfully applied in a variety of practical scenarios. In the second part of this article, we describe a recently discovered hashing-based algorithm, for the case where the objects are points in the d-dimensional Euclidean space. As it turns out, the performance of this algorithm is provably near-optimal in the class of the locality-sensitive hashing algorithms.
The goal of this article is twofold. In the first part, we survey a family of nearest neighbor algorithms that are based on the concept of localitysensitive hashing. Many of these algorithm have already been successfully applied in a variety of practical scenarios. In the second part of this article, we describe a recently discovered hashing-based algorithm, for the case where the objects are points in the d-dimensional Euclidean space. As it turns out, the performance of this algorithm is provably near-optimal in the class of the locality-sensitive hashing algorithms.
We show an optimal data-dependent hashing scheme for the approximate near neighbor problem. For an n-point dataset in a d-dimensional space our data structure achieves query time O(d · n ρ+o(1) ) and space O(n 1+ρ+o(1) + d · n), where ρ = 1 2c 2 −1 for the Euclidean space and approximation c > 1. For the Hamming space, we obtain an exponent of ρ = 1 2c−1 . Our result completes the direction set forth in [5] who gave a proof-of-concept that data-dependent hashing can outperform classic Locality Sensitive Hashing (LSH). In contrast to [5], the new bound is not only optimal, but in fact improves over the best (optimal) LSH data structures [15,3] for all approximation factors c > 1.From the technical perspective, we proceed by decomposing an arbitrary dataset into several subsets that are, in a certain sense, pseudo-random.
We give algorithms for geometric graph problems in the modern parallel models such as MapReduce. For example, for the Minimum Spanning Tree (MST) problem over a set of points in the two-dimensional space, our algorithm computes a (1 + )-approximate MST. Our algorithms work in a constant number of rounds of communication, while using total space and communication proportional to the size of the data (linear space and near linear time algorithms). In contrast, for general graphs, achieving the same result for MST (or even connectivity) remains a challenging open problem [9], despite drawing significant attention in recent years.We develop a general algorithmic framework that, besides MST, also applies to Earth-Mover Distance (EMD) and the transportation cost problem. Our algorithmic framework has implications beyond the MapReduce model. For example it yields a new algorithm for computing EMD cost in the plane in near-linear time, n 1+o (1) . We note that while recently [33] have developed a near-linear time algorithm for (1 + )-approximating EMD, our algorithm is fundamentally different, and, for example, also solves the transportation (cost) problem, raised as an open question in [33]. Furthermore, our algorithm immediately gives a (1+ )-approximation algorithm with n δ space in the streamingwith-sorting model with 1/δ O(1) passes. As such, it is tempting to conjecture that the parallel models may also constitute a concrete playground in the quest for efficient algorithms for EMD (and other similar problems) in the vanilla streaming model, a well-known open problem.
We show tight upper and lower bounds for time-space trade-offs for the c-approximate Near Neighbor Search problem. For the d-dimensional Euclidean space and npoint datasets, we develop a data structure with space n 1+ρu+o(1) + O(dn) and query time n ρq+o(1) + dn o(1) for every ρ u , ρ q ≥ 0 with:In particular, for the approximation c = 2 we get:• Space n 1.77... and query time n o(1) , significantly improving upon known data structures that support very fast queries [IM98, KOR00];• Space n 1.14... and query time n 0.14... , matching the optimal data-dependent Locality-Sensitive Hashing (LSH) from [AR15];• Space n 1+o(1) and query time n 0.43... , making significant progress in the regime of near-linear space, which is arguably of the most interest for prac-This is the first data structure that achieves sublinear query time and near-linear space for every approximation factor c > 1, improving upon [Kap15]. The data structure is a culmination of a long line of work on the problem for all space regimes; it builds on Spherical Locality-Sensitive Filtering [BDGL16] and datadependent hashing [AINR14, AR15]. Our matching lower bounds are of two types: conditional and unconditional. First, we prove tightness of the whole trade-off (0.1) in a restricted model of computation, which captures all known hashing-based approaches. We then show unconditional cell-probe lower * This paper merges two arXiv preprints: [Laa15c] (appeared online on November 24, 2015) and [ALRW16] (appeared online on May 9, 2016), and subsumes both of these articles. The full version containing all the proofs is available at https://arxiv.org/abs/1608.03580 bounds for one and two probes that match (0.1) for ρ q = 0, improving upon the best known lower bounds from [PTW10]. In particular, this is the first space lower bound (for any static data structure) for two probes which is not polynomially smaller than the one-probe bound. To show the result for two probes, we establish and exploit a connection to locally-decodable codes.
We present a new data structure for the -approximate near neighbor problem (ANN) in the Euclidean space. For points in R , our algorithm achieves ( + log ) query time and ( 1+
We investigate the optimality of (1+ )-approximation algorithms obtained via the dimensionality reduction method. We show that:• Any data structure for the (1 + )-approximate nearest neighbor problem in Hamming space, which uses constant number of probes to answer each query, must use n•
Abstract-A technique introduced by Indyk and Woodruff (STOC 2005) has inspired several recent advances in data-stream algorithms. We show that a number of these results follow easily from the application of a single probabilistic method called Precision Sampling. Using this method, we obtain simple datastream algorithms that maintain a randomized sketch of an input vector x = (x1, x2, . . . , xn), which is useful for the following applications:
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.