A Randomized Approximate Nearest Neighbors Algorithm

Jones, Peter W.; Osipov, Alexander V.; Rokhlin, Vladimir

doi:10.21236/ada639824

2011

DOI: 10.21236/ada639824

|View full text |Cite

A Randomized Approximate Nearest Neighbors Algorithm - A Short Version

Peter W. Jones¹,

Alexander V. Osipov²,

Vladimir Rokhlin³

Abstract: We present a randomized algorithm for the approximate nearest neighbor problem in ddimensional Euclidean space. Given N points {x j } in R d , the algorithm attempts to find k nearest neighbors for each of x j , where k is a user-specified integer parameter. The algorithm is iterative, and its CPU time requirements are proportional to T • N • (d • (log d) + k • (d + log k) • (log N)) + N • k 2 • (d + log k), with T the number of iterations performed. The memory requirements of the procedure are of the order N … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2013

Publication Types

Select...

Article1

Relationship

Self Cite1

Independent0

Authors

Journals

Cited by 1 publication

References 10 publications

(22 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

Jones

Osipov

Rokhlin

2013

Applied and Computational Harmonic Analysis

Self Cite

View full text Add to dashboard Cite

We present a randomized algorithm for the approximate nearest neighbor problem in d-dimensional Euclidean space. Given N points fx j g in R d , the algorithm attempts to find k nearest neighbors for each of x j , where k is a user-specified integer parameter. The algorithm is iterative, and its running time requirements are proportional to T · N · ðd · ðlog dÞ þ k · ðd þ log kÞ · ðlog NÞÞþ N · k 2 · ðd þ log kÞ, with T the number of iterations performed. The memory requirements of the procedure are of the order N · ðd þ kÞ. A by-product of the scheme is a data structure, permitting a rapid search for the k nearest neighbors among fx j g for an arbitrary point x ∈ R d . The cost of each such query is proportional to T · ðd · ðlog dÞ þ logðN∕kÞ · k · ðd þ log kÞÞ, and the memory requirements for the requisite data structure are of the order N · ðd þ kÞ þ T · ðd þ NÞ. The algorithm utilizes random rotations and a basic divide-and-conquer scheme, followed by a local graph search. We analyze the scheme's behavior for certain types of distributions of fx j g and illustrate its performance via several numerical examples.data mining | dimensionality reduction | fast random rotations I n this paper, we describe an algorithm for finding approximate nearest neighbors (ANNs) in d-dimensional Euclidean space for each of N user-specified points fx j g. For each point x j , the scheme produces a list of k "suspects" that have high probability of being the k closest points (nearest neighbors) in the Euclidean metric. Those of the suspects that are not among the "true" nearest neighbors are close to being so.We present several measures of performance (in terms of statistics of the k chosen suspected nearest neighbors), for different types of randomly generated datasets consisting of N points in R d . Unlike other ANN algorithms that have been recently proposed (see, e.g., ref. 1), the method of this paper does not use locality-sensitive hashing. Instead we use a simple randomized divide-and-conquer approach. The basic algorithm is iterated several times and then followed by a local graph search.The performance of any fast ANN algorithm must deteriorate as the dimension d increases. Although the running time of our algorithm only grows as d · log d, the statistics of the selected approximate nearest neighbors deteriorate as the dimension d increases. We provide bounds for this deterioration (both analytically and empirically), which occurs reasonably slowly as d increases. Although the actual estimates are fairly complicated, it is reasonable to say that in 20 dimensions the scheme performs extremely well, and the performance does not seriously deteriorate until d is approximately 60. At d ¼ 100, the degradation of the statistics displayed by the algorithm is quite noticeable.An outline of our algorithm is as follows:1. Choose a random rotation, acting on R d , and rotate the N given points. 2. Take the first coordinate and divide the dataset into two boxes, where the boxes are divided by finding the median in the first coordinate...

show abstract