Big data analysis requires the presence of large computing powers, which is not always feasible. And so, it became necessary to develop new clustering algorithms capable of such data processing. This study proposes a new parallel clustering algorithm based on the k‐means algorithm. It significantly reduces the exponential growth of computations. The proposed algorithm splits a dataset into batches while preserving the characteristics of the initial dataset and increasing the clustering speed. The idea is to define cluster centroids, which are also clustered, for each batch. According to the obtained centroids, the data points belong to the cluster with the nearest centroid. Real large datasets are used to conduct the experiments to evaluate the effectiveness of the proposed approach. The proposed approach is compared with k‐means and its modification. The experiments show that the proposed algorithm is a promising tool for clustering large datasets in comparison with the k‐means algorithm.
Selection of the right tool for anomaly (outlier) detection in Big data is an urgent task. In this paper algorithms for data clustering and outlier detection that take into account the compactness and separation of clusters are provided. We consider the features of their use in this capacity. Numerical experiments on real data of different sizes demonstrate the effectiveness of the proposed algorithms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.