Hierarchical density-based clustering is a powerful tool for exploratory data analysis, which can play an important role in the understanding and organization of datasets. However, its applicability to large datasets is limited because the computational complexity of hierarchical clustering methods has a quadratic lower bound in the number of objects to be clustered. MapReduce is a popular programming model to speed up data mining and machine learning algorithms operating on large, possibly distributed datasets. In the literature, there have been attempts to parallelize algorithms such as Single-Linkage, which in principle can also be extended to the broader scope of hierarchical density-based clustering, but hierarchical clustering algorithms are inherently difficult to parallelize with MapReduce. In this paper, we discuss why adapting previous approaches to parallelize Single-Linkage clustering using MapReduce leads to very inefficient solutions when one wants to compute density-based clustering hierarchies. Preliminarily, we discuss one such solution, which is based on an exact, yet very computationally demanding, random blocks parallelization scheme. To be able to efficiently apply hierarchical density-based clustering to large datasets using MapReduce, we then propose a different parallelization scheme that computes an approximate clustering hierarchy based on a much faster, recursive sampling approach. This approach is based on HDBSCAN*, the state-of-the-art hierarchical density-based clustering algorithm, combined with a data summarization technique called data bubbles. The proposed method is evaluated in terms of both runtime and quality of the approximation on a number of datasets, showing its effectiveness and scalability.
Agradeço ao meu orientador, professor Dr. Ricardo, pela oportunidade e presteza para com o desenvolvimento da pesquisa. Também gostaria de agradecer aos colaboradores, professor Dr. Murilo Coelho Naldi e professor Dr. Jörg Sander, pelo apoio constante e paciência durante o desenvolvimento deste trabalho. Gostaria de agradecer à minha mãe Ivani, meu pai Jovino e minha irmã Luciana por sempre me apoiarem nas minhas decisões e me ajudarem em momentos em que mais precisei, antes e durante o curso de mestrado. Agradeço também à minha namorada Lohany pelo carinho, paciência e compreensão em momentos importantes de minha jornada acadêmica. Gostaria de agradecer a todos os amigos e colegas que participaram da minha vida acadêmica durante esses quase três anos de pesquisa. Em especial, gostaria de agradecer aos amigos, Misael, Jonathan, Francisco, Filomen, Evinton, Lucas e Weslei pelos momentos de descontração, trocas de conhecimentos, conversas, risadas e lágrimas às vezes. Agradeço pelo apoio financeiro da CAPES e pelo suporte tecnológico fornecido pela FAPEMIG durante minha pesquisa de campo na Universidade Federal de Viçosa-campus de Rio Paranaíba-MG. Agradeço também ao Instituto de Ciências Matemáticas e de Computação (ICMC) pela oportunidade e espaço para que eu pudesse aprender e fazer ciência.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.