Martin Perdacher scite author profile

2017

Today's microprocessors consist of multiple cores each of which can perform multiple additions, multiplications, or other operations simultaneously in one clock cycle. To maximize performance, two types of parallelism must be applied in a data mining algorithm: MIMD (Multiple Instruction Multiple Data) where different CPU cores execute different code and follow different threads of control, and SIMD (Single Instruction Multiple Data) where within a core, the same operation is executed at once on various data. It is commonly agreed among data mining practitioners and researchers that dis-proportionally few works consider the performance potential of today's popular micro-architectures. In this paper, we consider the wide-spread clustering algorithm K-means as a highly relevant use-case for knowledge discovery on big data. We propose Multi-core K-Means (MKM), a completely re-engineered clustering algorithm which applies MIMD and SIMD parallelism. MKM uses a sophisticated strategy for the access of data vectors and cluster representatives to minimize data transfer between main memory, cache, and registers. For SIMD parallelism it is also essential to avoid branching operations like if-then: we propose to code cluster IDs and distances in joint variables to perform the argmin operation SIMD-parallel and without any branching. Our experiments demonstrate a speed-up which is almost linear in the number of cores. On a pair of shared-memory quad-core processors, MKM is between 95 and 140 times faster than non-parallel K-means, 4-6 times faster than auto-vectorized fully parallel standard K-means, and 2.1 times faster than K-means based on BLAS.

show abstract

Cache-oblivious High-performance Similarity Join

2019

A Novel Hilbert Curve for Cache-Locality Preserving Loops

2021

IEEE Trans. Big Data

Modern microprocessors offer a rich memory hierarchy including various levels of cache and registers. Some of these memories (like main memory, L3 cache) are big but slow and shared among all cores. Others (registers, L1 cache) are fast and exclusively assigned to a single core but small. Only if the data accesses have a high locality, we can avoid excessive data transfers between the memory hierarchy. In this paper we consider fundamental algorithms like matrix multiplication, K-Means, Cholesky decomposition as well as the algorithm by Floyd and Warshall typically operating in two or three nested loops. We propose to traverse these loops whenever possible not in the canonical order but in an order defined by a space-filling curve. This traversal order dramatically improves data locality over a wide granularity allowing not only to efficiently support a cache of a single, known size (cache conscious) but also a hierarchy of various caches where the effective size available to our algorithms may even be unknown (cache oblivious). We propose a new space-filling curve called Fast Unrestricted (FUR) Hilbert with the following advantages: (1) we overcome the usual limitation to square-like grid sizes where the side-length is a power of 2 or 3. Instead, our approach allows arbitrary loop boundaries for all variables. (2) FUR-Hilbert is non-recursive with a guaranteed constant worst case time complexity per loop iteration (in contrast to O(log(gridsize)) for previous methods). (3) Our non-recursive approach makes the application of our cache-oblivious loops in any host algorithm as easy as conventional loops and facilitates automatic optimization by the compiler. (4) We demonstrate that crucial algorithms like Cholesky decomposition as well as the algorithm by Floyd and Warshall by can be efficiently supported. (5) Extensive experiments on runtime efficiency, cache usage and energy consumption demonstrate the profit of our approach. We believe that future compilers could translate nested loops into cache-oblivious loops either fully automatic or by a user-guided analysis of the data dependency.

show abstract

Cache-oblivious loops based on a novel space-filling curve

2016

Improved Data Locality Using Morton-order Curve on the Example of LU Decomposition

2020