A distributed kernel summation framework for general‐dimension machine learning

Lee, Dongryeol; Sao, Piyush; Vuduc, Richard; Gray, Alexander G.

doi:10.1002/sam.11207

Cited by 14 publications

(10 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are still interesting future directions to pursue, though. The first direction is parallelism: because our dual-tree algorithm is agnostic to the type of traversal used, we may use a parallel traversal (Curtin et al, 2013b), such as an adapted version of a recent parallel dual-tree algorithm (Lee et al, 2012). The second direction is kernel k-means and other spectral clustering techniques: our algorithm may be merged with the ideas of Curtin & Ram (2014) to perform kernel k-means.…”

Section: Discussionmentioning

confidence: 99%

Dual-tree $k$-means with bounded iteration runtime

Curtin

2016

Preprint

View full text Add to dashboard Cite

k-means is a widely used clustering algorithm, but for k clusters and a dataset size of N , each iteration of Lloyd's algorithm costs O(kN ) time. Although there are existing techniques to accelerate single Lloyd iterations, none of these are tailored to the case of large k, which is increasingly common as dataset sizes grow. We propose a dual-tree algorithm that gives the exact same results as standard k-means; when using cover trees, we use adaptive analysis techniques to, under some assumptions, bound the single-iteration runtime of the algorithm as O(N + k log k).To our knowledge these are the first sub-O(kN ) bounds for exact Lloyd iterations. We then show that this theoretically favorable algorithm performs competitively in practice, especially for large N and k in low dimensions. Further, the algorithm is treeindependent, so any type of tree may be used.

show abstract

Section: Discussionmentioning

confidence: 99%

Dual-tree $k$-means with bounded iteration runtime

Curtin

2016

Preprint

View full text Add to dashboard Cite

show abstract

“…The dual-tree method is based on space partitioning trees for both the input sample and the evaluation points. These tree structures are then used to compute distances between input points and evaluation points more quickly, see Gray and Moore (2001), Gray and Moore (2003), Lang et al (2005), Lee et al (2006), Ram et al (2009), Curtin et al (2013), Griebel and Wissel (2013), Lee et al (2014). Among all these methods, the fast sum updating is the only one which is exact (no extra approximation is introduced) and whose speed is independent of the input data, the kernel and the bandwidth.…”

Section: Introductionmentioning

confidence: 99%

Fast and Stable Multivariate Kernel Density Estimation by Fast Sum Updating

Langrené

Warin²

2019

Journal of Computational and Graphical Statistics

View full text Add to dashboard Cite

Kernel density estimation and kernel regression are powerful but computationally expensive techniques: a direct evaluation of kernel density estimates at M evaluation points given N input sample points requires a quadratic O(M N ) operations, which is prohibitive for large scale problems. For this reason, approximate methods such as binning with Fast Fourier Transform or the Fast Gauss Transform have been proposed to speed up kernel density estimation. Among these fast methods, the Fast Sum Updating approach is an attractive alternative, as it is an exact method and its speed is independent of the input sample and the bandwidth. Unfortunately, this method, based on data sorting, has for the most part been limited to the univariate case. In this paper, we revisit the fast sum updating approach and extend it in several ways. Our main contribution is to extend it to the general multivariate case for general input data and rectilinear evaluation grid. Other contributions include its extension to a wider class of kernels, including the triangular, cosine and Silverman kernels, its combination with parsimonious additive multivariate kernels, and its combination with a fast approximate k-nearest-neighbors bandwidth for multivariate datasets. Our numerical tests of multivariate regression and density estimation confirm the speed, accuracy and stability of the method. We hope this paper will renew interest for the fast sum updating approach and help solve largescale practical density estimation and regression problems.

show abstract

“…In the original FMM, kernel function is approximated by analytical tools (either with addition theorems of special functions or Taylor expansions) [12,4,6,10,5]. To overcome the difficulties when analytic formulation of kernel functions is not available, various semi-analytic [1,11,18] and algebraic FMMs [21,22,23] were developed in recent decades. In some other approaches [16,17], the whole kernel matrix is split into block matrices with various ranks, on each of which the SVD was implemented and then a truncated summation was used.…”

Section: Introductionmentioning

confidence: 99%