mlpack 3: a fast, flexible machine learning library

Curtin, Ryan R.; Edel, Marcus; Lozhnikov, Mikhail; Mentekidis, Yannis; Ghaisas, Sumedh; Zhang, Shangtong

doi:10.21105/joss.00726

Cited by 119 publications

(116 citation statements)

References 6 publications

Supporting

Mentioning

115

Contrasting

Order By: Relevance

“…We use the publicly available mlpack kmeans program in mlpack [12]; we run it as $ mlpack_kmeans -i dataset.csv -I centroids.csv -c $k -v -e -a $algorithm where $k is the number of clusters and $algorithm is the algorithm to be used. [28,35,30].…”

Section: Methodsmentioning

confidence: 99%

A Dual-Tree Algorithm for Fast k-means Clustering With Large k

Curtin

2017

Proceedings of the 2017 SIAM International Conference on Data Mining

View full text Add to dashboard Cite

k-means is a widely used clustering algorithm, but for k clusters and a dataset size of N , each iteration of Lloyd's algorithm costs O(kN ) time. This is problematic because increasingly, applications of k-means involve both large N and large k, and there are no accelerated variants that handle this situation. To this end, we propose a dual-tree algorithm that gives the exact same results as standard kmeans; when using cover trees, we bound the single-iteration runtime of the algorithm as O(N + k log k), under some assumptions. To our knowledge these are the first sub-O(kN ) bounds for exact Lloyd iterations. The algorithm performs competitively in practice, especially for large N and k in low dimensions. Further, the algorithm is treeindependent, so any type of tree may be used.

show abstract

Section: Methodsmentioning

confidence: 99%

A Dual-Tree Algorithm for Fast k-means Clustering With Large k

Curtin

2017

Proceedings of the 2017 SIAM International Conference on Data Mining

View full text Add to dashboard Cite

show abstract

“…Besides exploiting the structure of the input data and the learning task, the problem of learning models over databases can also benefit tremendously from database system techniques. Recent work [50] showed non-trivial speedups (several orders of magnitude) brought by code optimization for machine learning workloads over state-of-the-art systems such as TensorFlow [1], R [46], Scikit-learn [44], and mlpack [13]. Prime examples of code optimizations leading to such performance improvements include:…”

Section: Database Systems Considerationsmentioning

confidence: 99%

Learning Models over Relational Data: A Brief Tutorial

Schleich

Olteanu

Abo-Khamis³

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

This tutorial overviews the state of the art in learning models over relational databases and makes the case for a first-principles approach that exploits recent developments in database research.The input to learning classification and regression models is a training dataset defined by feature extraction queries over relational databases. The mainstream approach to learning over relational data is to materialize the training dataset, export it out of the database, and then learn over it using a statistical package. This approach can be expensive as it requires the materialization of the training dataset. An alternative approach is to cast the machine learning problem as a database problem by transforming the data-intensive component of the learning task into a batch of aggregates over the feature extraction query and by computing this batch directly over the input database.The tutorial highlights a variety of techniques developed by the database theory and systems communities to improve the performance of the learning task. They rely on structural properties of the relational data and of the feature extraction query, including algebraic (semi-ring), combinatorial (hypertree width), statistical (sampling), or geometric (distance) structure. They also rely on factorized computation, code specialization, query compilation, and parallelization.

show abstract

“…Although this step serves only for coarsening the data representation, it dominates the computation cost of the first four steps. Two state-of-the-art K-means++ implementations were tested, K-MeansRex [26], and scalable mlpack package [27]. For a test run with the data points in the range of 1 to 100K (d = 2, n c = 100), K-means++ from mlpack was 1.82 times faster on average in execution than K-MeansRex's implementation.…”

Section: Clustering Stepmentioning

confidence: 99%

A graphical heuristic for reduction and partitioning of large datasets for scalable supervised training

Yadav¹,

Bode

2019

J Big Data

View full text Add to dashboard Cite

A scalable graphical method is presented for selecting, and partitioning datasets for the training phase of a classification task. For the heuristic, a clustering algorithm is required to get its computation cost in a reasonable proportion to the task itself. This step is proceeded by construction of an information graph of the underlying classification patterns using approximate nearest neighbor methods. The presented method constitutes of two approaches, one for reducing a given training set, and another for partitioning the selected/reduced set. The heuristic targets large datasets, since the primary goal is significant reduction in training computation run-time without compromising prediction accuracy. Test results show that both approaches significantly speed-up the training task when compared against that of state-of-the-art shrinking heuristic available in LIBSVM. Furthermore, the approaches closely follow or even outperform in prediction accuracy. A network design is also presented for the partitioning based distributed training formulation. Added speed-up in training run-time is observed when compared to that of serial implementation of the approaches.

show abstract

mlpack 3: a fast, flexible machine learning library

Cited by 119 publications

References 6 publications

A Dual-Tree Algorithm for Fast k-means Clustering With Large k

A Dual-Tree Algorithm for Fast k-means Clustering With Large k

Learning Models over Relational Data: A Brief Tutorial

A graphical heuristic for reduction and partitioning of large datasets for scalable supervised training

Contact Info

Product

Resources

About