Today's microprocessors consist of multiple cores each of which can perform multiple additions, multiplications, or other operations simultaneously in one clock cycle. To maximize performance, two types of parallelism must be applied in a data mining algorithm: MIMD (Multiple Instruction Multiple Data) where different CPU cores execute different code and follow different threads of control, and SIMD (Single Instruction Multiple Data) where within a core, the same operation is executed at once on various data. It is commonly agreed among data mining practitioners and researchers that dis-proportionally few works consider the performance potential of today's popular micro-architectures. In this paper, we consider the wide-spread clustering algorithm K-means as a highly relevant use-case for knowledge discovery on big data. We propose Multi-core K-Means (MKM), a completely re-engineered clustering algorithm which applies MIMD and SIMD parallelism. MKM uses a sophisticated strategy for the access of data vectors and cluster representatives to minimize data transfer between main memory, cache, and registers. For SIMD parallelism it is also essential to avoid branching operations like if-then: we propose to code cluster IDs and distances in joint variables to perform the argmin operation SIMD-parallel and without any branching. Our experiments demonstrate a speed-up which is almost linear in the number of cores. On a pair of shared-memory quad-core processors, MKM is between 95 and 140 times faster than non-parallel K-means, 4-6 times faster than auto-vectorized fully parallel standard K-means, and 2.1 times faster than K-means based on BLAS.