k-means is a widely used clustering algorithm, but for k clusters and a dataset size of N , each iteration of Lloyd's algorithm costs O(kN) time. This is problematic because increasingly, applications of k-means involve both large N and large k, and there are no accelerated variants that handle this situation. To this end, we propose a dual-tree algorithm that gives the exact same results as standard k-means; when using cover trees, we bound the single-iteration runtime of the algorithm as O(N + k log k), under some assumptions. To our knowledge these are the first sub-O(kN) bounds for exact Lloyd iterations. The algorithm performs competitively in practice, especially for large N and k in low dimensions. Further, the algorithm is tree-independent, so any type of tree may be used.
GPUSVM (Graphic Processing Unit Support Vector Machine) is a Computing Unified Device Architecture (CUDA)based Support Vector Machine (SVM) package. It is designed to offer an end-user a fully functional and user friendly SVM tool which utilizes the power of GPUs. The core package includes an efficient cross validation tool, a fast training tool and a predicting tool. In this article, we first introduce the background theory of how we build our parallel SVM solver using CUDA programming model. Then we compare our GPUSVM package with the popular state of the art Libsvm package on several well known datasets. The preliminary results have shown one to two orders of magnitude speed improvement in both training and predicting phases compared to Libsvm using our Tesla server.
This paper presents an approach for comparing various feature ranking (FR) methods. First, six classification benchmarks are created using Exhaustive Search (ES) to select the best feature subsets. The subset selections have been done within double (nested) cross-validation procedures guaranteeing realistic accuracy predictions to unseen examples. Next, seven filter FR approaches are compared and ranked in respect to the top five best feature subsets for each data set. This paper also introduces a method for quantifying and comparing FR results. The results hint that using Gini index or scatter ratios leads to rankings closest to ES on average.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.