A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

Sinha, Ankita; Jana, Prasanta K.

doi:10.1007/s11227-017-2182-8

Cited by 36 publications

(25 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the other hand, [11] used a Genetic Algorithm (GA) with Mahalanobis distance along with the K-means clustering algorithm as an influential combination to propose a two-phase clustering algorithm for distributed datasets. In first phase, GA is utilized in parallel on fragments, which were assigned to different sites.…”

Section: Related Workmentioning

confidence: 99%

“…While Equation (11) computed the attribute access matrix of sites (AAMS), AAMS was used to yield the total access cost matrix for all sites (TACS) with the help of Equation (11). In Equations (12) and (13), the final allocation of fragments over the cluster of sites was decided when second and third scenarios of allocation were being addressed.…”

Section: Cost Functionsmentioning

confidence: 99%

“…The first step was to produce the attribute access matrix of sites (AAMS) using QFM and AIM along with Equation (11). Every AAMS ij gave the net access cost of each site S j , to reach Attribute A i , Table 12.…”

Section: Allocation Processmentioning

confidence: 99%

See 2 more Smart Citations

Towards an Efficient Data Fragmentation, Allocation, and Clustering Approach in a Distributed Environment

Abdalla

Artoli

2019

Information

View full text Add to dashboard Cite

Data fragmentation and allocation has for long proven to be an efficient technique for improving the performance of distributed database systems’ (DDBSs). A crucial feature of any successful DDBS design revolves around placing an intrinsic emphasis on minimizing transmission costs (TC). This work; therefore, focuses on improving distribution performance based on transmission cost minimization. To do so, data fragmentation and allocation techniques are utilized in this work along with investigating several data replication scenarios. Moreover, site clustering is leveraged with the aim of producing a minimum possible number of highly balanced clusters. By doing so, TC is proved to be immensely reduced, as depicted in performance evaluation. DDBS performance is measured using TC objective function. An inclusive evaluation has been made in a simulated environment, and the compared results have demonstrated the superiority and efficacy of the proposed approach on reducing TC.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Cost Functionsmentioning

confidence: 99%

See 1 more Smart Citation

Towards an Efficient Data Fragmentation, Allocation, and Clustering Approach in a Distributed Environment

Abdalla

Artoli

2019

Information

View full text Add to dashboard Cite

show abstract

“…Yuan [23] proposed an improved K-means parallel algorithm has also achieved good results. Ankita [24] combines genetic algorithm with k-means algorithm and proposed a novel clustering algorithm for distributed datasets. The above work proves that the algorithm based on MapReduce can well avoid the limitation of data size, and makes the mining of hyper-scale product review data possible.…”

Section: Related Workmentioning

confidence: 99%

“…In order to further verify the feasibility of the PR-HD algorithm, we choose a total of three MapReduce-based algorithms from reference [22], [23] and [24] as the comparison algorithm. Reference [22] improved VSM model, and designed a parallel fuzzy c-means algorithm for hot microblogging topics discovery(HTD-PFCM).…”

Section: ) Accuracy Analysismentioning

confidence: 99%

Research on Product Reviews Hot Spot Discovery Algorithm Based on Mapreduce

Liu

2020

IEEE Access

View full text Add to dashboard Cite

In recent years, with the development of e-commerce, the scale of comment data has shown an exponential growth trend. In this paper, a product review hot spot discovery algorithm based on MapReduce-PR-HD is proposed. The algorithm uses the Vector Space Model to vectorize the text data of the reviews, and utilize the TF-IDF algorithm to calculate the position weight of the feature words, then combines the Canopy algorithm and the K-Means algorithm to achieve the hot spot discovery of product reviews. At the same time, the algorithm obtain the ability to process massive data through the MapReduce framework. Experiments demonstrate that the PR-HD algorithm has high accuracy and parallel efficiency. This allows product developers to obtain more direct and effective suggestions and feedback, which allows product developers to obtain more direct and effective suggestions and feedback.

show abstract

Genetic Algorithm Based Parallel K-Means Data Clustering Algorithm Using MapReduce Programming Paradigm on Hadoop Environment (GAPKCA)

Alshammari

Zolkepli

Abdullah³

2019

Advances in Intelligent Systems and Computing

View full text Add to dashboard Cite

A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

Cited by 36 publications

References 23 publications

Towards an Efficient Data Fragmentation, Allocation, and Clustering Approach in a Distributed Environment

Towards an Efficient Data Fragmentation, Allocation, and Clustering Approach in a Distributed Environment

Research on Product Reviews Hot Spot Discovery Algorithm Based on Mapreduce

Genetic Algorithm Based Parallel K-Means Data Clustering Algorithm Using MapReduce Programming Paradigm on Hadoop Environment (GAPKCA)

Contact Info

Product

Resources

About