Abstract:Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k pa… Show more
“…Chierichetti et al [6] implemented an existing greedy Maxk-cover algorithm using MapReduce efficiently and achieved provably approximation to sequential results. Bahmani et al [2] obtained a parallel implementation of K-means++ [1] and empirically showed to have achieved similar results in a constant number of rounds. MapReduce solutions have also been proposed for anonymization [20,22,21], but were limited to achieving k-anonymity for relational data only.…”
Section: Related Workmentioning
confidence: 99%
“…Note that map output with the same key must be sent to the same reducer, so the number of reducers needed is determined by min(R, (2). The reduce cost t R of a single MapReduce round is dominated by the cost of setting up the reducers and reading the shuffled data sent by the mappers:…”
Abstract. Privacy is a concern when publishing transaction data for applications such as marketing research and biomedical studies. While methods for anonymizing transaction data exist, they are designed to run on a single machine, hence not scalable to large datasets. Recently, MapReduce has emerged as a highly scalable platform for data-intensive applications. In the paper, we consider how MapReduce may be used to provide scalability in transaction anonymization. More specifically, we consider how RBAT may be parallelized using MapReduce. RBAT is a sequential method that has some desirable features for transaction anonymization, but its highly iterative nature makes its parallelization challenging. A direct implementation of RBAT on MapReduce using data partitioning alone can result in significant overhead, which can offset the gains from parallel processing. We propose MR-RBAT that employs two parameters to control parallelization overhead. Our experimental results show that MR-RBAT can scale linearly to large datasets and can retain good data utility.
“…Chierichetti et al [6] implemented an existing greedy Maxk-cover algorithm using MapReduce efficiently and achieved provably approximation to sequential results. Bahmani et al [2] obtained a parallel implementation of K-means++ [1] and empirically showed to have achieved similar results in a constant number of rounds. MapReduce solutions have also been proposed for anonymization [20,22,21], but were limited to achieving k-anonymity for relational data only.…”
Section: Related Workmentioning
confidence: 99%
“…Note that map output with the same key must be sent to the same reducer, so the number of reducers needed is determined by min(R, (2). The reduce cost t R of a single MapReduce round is dominated by the cost of setting up the reducers and reading the shuffled data sent by the mappers:…”
Abstract. Privacy is a concern when publishing transaction data for applications such as marketing research and biomedical studies. While methods for anonymizing transaction data exist, they are designed to run on a single machine, hence not scalable to large datasets. Recently, MapReduce has emerged as a highly scalable platform for data-intensive applications. In the paper, we consider how MapReduce may be used to provide scalability in transaction anonymization. More specifically, we consider how RBAT may be parallelized using MapReduce. RBAT is a sequential method that has some desirable features for transaction anonymization, but its highly iterative nature makes its parallelization challenging. A direct implementation of RBAT on MapReduce using data partitioning alone can result in significant overhead, which can offset the gains from parallel processing. We propose MR-RBAT that employs two parameters to control parallelization overhead. Our experimental results show that MR-RBAT can scale linearly to large datasets and can retain good data utility.
“…Using the LDA document-cluster distribution instead of TFIDF leads to a significant reduction in processing time, however the quality of the resulting model suffers and is lower than the pure LDA results. K-Means is thus only viable for smaller collections, although the exact limit depends on what optimisations [25] can be achieved through improved initialisation [4] or parallelisation [32].…”
Section: # Main Cluster Intruder Cluster 1 Renfrews All Viewsmentioning
Abstract. Large digital libraries have become available over the past years through digitisation and aggregation projects. These large collections present a challenge to the new user who wishes to discover what is available in the collections. Subject classification can help in this task, however in large collections it is frequently incomplete or inconsistent. Automatic clustering algorithms provide a solution to this, however the question remains whether they produce clusters that are sufficiently cohesive and distinct for them to be used in supporting discovery and exploration in digital libraries. In this paper we present a novel approach to investigating cluster cohesion that is based on identifying instruders in a cluster. The results from a human-subject experiment show that clustering algorithms produce clusters that are sufficiently cohesive to be used where no (consistent) manual classification exists.
“…Chief among these is that traditional ways of processing data have become inadequate. This is witnessed in how the dozen or so traditional algorithms [2], e.g., k-means or EM, have begun to be rethought and retooled [3][4][5] to deal with data that is now too massive, too complex, produced too quickly, etc., to be effectively analyzed as we have in the past.…”
Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like expectation maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a nonlinear hierarchical data structure (heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM*. We show our EM* algorithm outperform EM-T algorithm over large real-world and synthetic data sets. We lastly conclude with some theoretical underpinnings that explain why EM* is successful.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.