The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
2012
DOI: 10.14778/2180912.2180915
|View full text |Cite
|
Sign up to set email alerts
|

Scalable k-means++

Abstract: Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k pa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
321
0
4

Year Published

2012
2012
2021
2021

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 539 publications
(340 citation statements)
references
References 33 publications
1
321
0
4
Order By: Relevance
“…Chierichetti et al [6] implemented an existing greedy Maxk-cover algorithm using MapReduce efficiently and achieved provably approximation to sequential results. Bahmani et al [2] obtained a parallel implementation of K-means++ [1] and empirically showed to have achieved similar results in a constant number of rounds. MapReduce solutions have also been proposed for anonymization [20,22,21], but were limited to achieving k-anonymity for relational data only.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Chierichetti et al [6] implemented an existing greedy Maxk-cover algorithm using MapReduce efficiently and achieved provably approximation to sequential results. Bahmani et al [2] obtained a parallel implementation of K-means++ [1] and empirically showed to have achieved similar results in a constant number of rounds. MapReduce solutions have also been proposed for anonymization [20,22,21], but were limited to achieving k-anonymity for relational data only.…”
Section: Related Workmentioning
confidence: 99%
“…Note that map output with the same key must be sent to the same reducer, so the number of reducers needed is determined by min(R, (2). The reduce cost t R of a single MapReduce round is dominated by the cost of setting up the reducers and reading the shuffled data sent by the mappers:…”
Section: Algorithm 3 Findmaxpair (ĩ)mentioning
confidence: 99%
“…Using the LDA document-cluster distribution instead of TFIDF leads to a significant reduction in processing time, however the quality of the resulting model suffers and is lower than the pure LDA results. K-Means is thus only viable for smaller collections, although the exact limit depends on what optimisations [25] can be achieved through improved initialisation [4] or parallelisation [32].…”
Section: # Main Cluster Intruder Cluster 1 Renfrews All Viewsmentioning
confidence: 99%
“…Chief among these is that traditional ways of processing data have become inadequate. This is witnessed in how the dozen or so traditional algorithms [2], e.g., k-means or EM, have begun to be rethought and retooled [3][4][5] to deal with data that is now too massive, too complex, produced too quickly, etc., to be effectively analyzed as we have in the past.…”
Section: Introductionmentioning
confidence: 99%