A New Data Mining Algorithm based on MapReduce and Hadoop

Yang, Xianfeng; Lian, Liming; Henan, Xinxiang

doi:10.14257/ijsip.2014.7.2.13

Cited by 11 publications

(5 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, people have proposed various approximations to PAM, such as CLARA and CLARANS discussed before. Yang and Lian (2014) parallelize the "k-means like" variant with map-reduce, parallelizing over the cluster in the reduce step. When cluster sizes vary substantially, this needs O(n 2 ) memory in the reducer, and may yield next to no speedup in the worst case.…”

Section: Variants Of Pammentioning

confidence: 99%

“…Nevertheless, a few seminal methods such as hierarchical clustering, k-means, PAM Rousseeuw, 1987, 1990c), and DBSCAN (Ester et al, 1996) have received repeated and widespread use. One may be tempted to think that these classic methods have all been well researched and understood, but there are still many scientific publications trying to explain these algorithms better (e.g., Schubert et al, 2017), trying to parallelize and scale them to larger data sets (e.g., Lijffijt et al, 2015;Yang and Lian, 2014), trying to better understand similarities and relationships among the published methods (e.g., , or proposing further improvements -and so does this paper for the widely used PAM algorithm, also often referred to as k-medoids clustering.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms

Schubert

Rousseeuw

2019

Lecture Notes in Computer Science

248

163

View full text Add to dashboard Cite

Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids.In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not exist for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains such as biology that require the use of Jaccard, Gower, or more complex distances.A key issue with PAM is, however, its high run time cost. In this paper, we propose modifications to the PAM algorithm where at the cost of storing O(k) additional values, we can achieve an O(k)-fold speedup in the second ("SWAP") phase of the algorithm, but will still find the same results as the original PAM algorithm. If we slightly relax the choice of swaps performed (while retaining comparable quality), we can further accelerate the algorithm by performing up to k swaps in each iteration. With the substantially faster SWAP, we can now also explore alternative (faster) strategies for choosing the initial medoids. We also show how the CLARA and CLARANS algorithms benefit from the proposed modifications.While we do not further study the parallelization of our approach in this work, it can easily be combined with earlier approaches to use PAM and CLARA on big data (some of which use PAM as a subroutine, hence can immediately benefit from these improvements), where the performance with high k becomes increasingly important.In experiments on real data with k = 100, we observed a 200× speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets as long as we can afford to compute a distance matrix, and in particular to higher k (at k = 2, the new SWAP was only 1.5 times faster, as the speedup is expected to increase with k).

show abstract

Section: Variants Of Pammentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms

Schubert

Rousseeuw

2019

Lecture Notes in Computer Science

248

163

View full text Add to dashboard Cite

show abstract

“…To analyze a lot of data with enough resources, we need to make clustering methods take less time and use less memory. The authors in [15] adapted MapReduce to medoids. During mapping, it places each object next to its closest medoid, and during reduction, it moves the real medoid to the center of the group.…”

Section: Related Workmentioning

confidence: 99%

Big Data Clustering Using Chemical Reaction Optimization Technique: A Computational Symmetry Paradigm for Location-Aware Decision Support in Geospatial Query Processing

et al. 2022

View full text Add to dashboard Cite

The emergence of geospatial big data has opened up new avenues for identifying urban environments. Although both geographic information systems (GIS) and expert systems (ES) have been useful in resolving geographical decision issues, they are not without their own shortcomings. The combination of GIS and ES has gained popularity due to the necessity of boosting the effectiveness of these tools in resolving very difficult spatial decision-making problems. The clustering method generates the functional effects necessary to apply spatial analysis techniques. In a symmetric clustering system, two or more nodes run applications and monitor each other simultaneously. This system is more efficient than an asymmetric system since it utilizes all available hardware and does not maintain a node in a hot standby state. However, it is still a major issue to figure out how to expand and speed up clustering algorithms without sacrificing efficiency. The work presented in this paper introduces an optimized hierarchical distributed k-medoid symmetric clustering algorithm for big data spatial query processing. To increase the k-medoid method’s efficiency and create more precise clusters, a hybrid approach combining the k-medoid and Chemical Reaction Optimization (CRO) techniques is presented. CRO is used in this approach to broaden the scope of the optimal medoid and improve clustering by obtaining more accurate data. The suggested paradigm solves the current technique’s issue of predicting the accurate clusters’ number. The suggested approach includes two phases: in the first phase, the local clusters are built using Apache Spark’s parallelism paradigm based on their portion of the whole dataset. In the second phase, the local clusters are merged to create condensed and reliable final clusters. The suggested approach condenses the data provided during aggregation and creates the ideal clusters’ number automatically based on the dataset’s structures. The suggested approach is robust and delivers high-quality results for spatial query analysis, as shown by experimental results. The proposed model reduces average query latency by 23%.

show abstract

“…It is already proved that the inverse matrix of {I -(1+d) -1 T} -1 exists [30]. CLV i is (mn+2)1 column matrix whose j th element denotes the cumulative profitability generated by customer i while he or she remains at the state.…”

Section: Figure 3 One-step Transition Matrixmentioning

confidence: 99%

“…Data mining is to discover hidden useful information in large databases. Mining frequent patterns from transaction databases is an important problem in data mining [30]. Figure 4 shows the data mining procedure to predict various transition probabilities in the case study.…”

Section: Decision Variables Of Individual CLVmentioning

confidence: 99%