Scalable k-means++

Bahmani, Bahman; Moseley, Benjamin; Vattani, Andrea; Kumar, Ravi; Vassilvitskii, Sergei

doi:10.14778/2180912.2180915

Cited by 539 publications

(340 citation statements)

References 33 publications

Supporting

Mentioning

321

Contrasting

Unclassified

Order By: Relevance

“…Chierichetti et al [6] implemented an existing greedy Maxk-cover algorithm using MapReduce efficiently and achieved provably approximation to sequential results. Bahmani et al [2] obtained a parallel implementation of K-means++ [1] and empirically showed to have achieved similar results in a constant number of rounds. MapReduce solutions have also been proposed for anonymization [20,22,21], but were limited to achieving k-anonymity for relational data only.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

MR-RBAT: Anonymizing Large Transaction Datasets Using MapReduce

Memon

Shao

2015

Data and Applications Security and Privacy XXIX

View full text Add to dashboard Cite

Abstract. Privacy is a concern when publishing transaction data for applications such as marketing research and biomedical studies. While methods for anonymizing transaction data exist, they are designed to run on a single machine, hence not scalable to large datasets. Recently, MapReduce has emerged as a highly scalable platform for data-intensive applications. In the paper, we consider how MapReduce may be used to provide scalability in transaction anonymization. More specifically, we consider how RBAT may be parallelized using MapReduce. RBAT is a sequential method that has some desirable features for transaction anonymization, but its highly iterative nature makes its parallelization challenging. A direct implementation of RBAT on MapReduce using data partitioning alone can result in significant overhead, which can offset the gains from parallel processing. We propose MR-RBAT that employs two parameters to control parallelization overhead. Our experimental results show that MR-RBAT can scale linearly to large datasets and can retain good data utility.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Note that map output with the same key must be sent to the same reducer, so the number of reducers needed is determined by min(R, (2). The reduce cost t R of a single MapReduce round is dominated by the cost of setting up the reducers and reading the shuffled data sent by the mappers:…”

Section: Algorithm 3 Findmaxpair (ĩ)mentioning

confidence: 99%

MR-RBAT: Anonymizing Large Transaction Datasets Using MapReduce

Memon

Shao

2015

Data and Applications Security and Privacy XXIX

View full text Add to dashboard Cite

show abstract

“…Using the LDA document-cluster distribution instead of TFIDF leads to a significant reduction in processing time, however the quality of the resulting model suffers and is lower than the pure LDA results. K-Means is thus only viable for smaller collections, although the exact limit depends on what optimisations [25] can be achieved through improved initialisation [4] or parallelisation [32].…”

Section: # Main Cluster Intruder Cluster 1 Renfrews All Viewsmentioning

confidence: 99%

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Hall

Clough

Stevenson

2012

Theory and Practice of Digital Libraries

View full text Add to dashboard Cite

Abstract. Large digital libraries have become available over the past years through digitisation and aggregation projects. These large collections present a challenge to the new user who wishes to discover what is available in the collections. Subject classification can help in this task, however in large collections it is frequently incomplete or inconsistent. Automatic clustering algorithms provide a solution to this, however the question remains whether they produce clusters that are sufficiently cohesive and distinct for them to be used in supporting discovery and exploration in digital libraries. In this paper we present a novel approach to investigating cluster cohesion that is based on identifying instruders in a cluster. The results from a human-subject experiment show that clustering algorithms produce clusters that are sufficiently cohesive to be used where no (consistent) manual classification exists.

show abstract

“…Chief among these is that traditional ways of processing data have become inadequate. This is witnessed in how the dozen or so traditional algorithms [2], e.g., k-means or EM, have begun to be rethought and retooled [3][4][5] to deal with data that is now too massive, too complex, produced too quickly, etc., to be effectively analyzed as we have in the past.…”

Section: Introductionmentioning

confidence: 99%

Using data to build a better EM: EM* for big data

Kurban

Jenne

Dalkılıç

2017

Int J Data Sci Anal

View full text Add to dashboard Cite

Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like expectation maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a nonlinear hierarchical data structure (heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM*. We show our EM* algorithm outperform EM-T algorithm over large real-world and synthetic data sets. We lastly conclude with some theoretical underpinnings that explain why EM* is successful.

show abstract

Scalable k-means++

Cited by 539 publications

References 33 publications

MR-RBAT: Anonymizing Large Transaction Datasets Using MapReduce

MR-RBAT: Anonymizing Large Transaction Datasets Using MapReduce

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Using data to build a better EM: EM* for big data

Contact Info

Product

Resources

About