Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark

Castro, Eduardo P. S.; Maia, Thiago D.; Pereira, Marluce Rodrigues; Esmin, Ahmed Ali Abdalla; Pereira, Denilson Alves

doi:10.1017/s0269888918000127

Cited by 12 publications

(6 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Te proposed algorithm reduces the number of transactions and the time spent processing the dataset. Castro et al [33] compared alternative Apriori [35], and it succeeded in reducing the communication cost during the shufe process by allocating the tasks across cluster nodes in a fair and efective way using search space division strategy. It showed better performance in running time, memory usage, and scalability.…”

Section: Related Workmentioning

confidence: 99%

ASCF: Optimization of the Apriori Algorithm Using Spark-Based Cuckoo Filter Structure

Alrahwan,

Farouk

2024

International Journal of Intelligent Systems

View full text Add to dashboard Cite

Data mining is the process used for extracting hidden patterns from large databases using a variety of techniques. For example, in supermarkets, we can discover the items that are often purchased together and that are hidden within the data. This helps make better decisions which improve the business outcomes. One of the techniques that are used to discover frequent patterns in large databases is frequent itemset mining (FIM) that is a part of association rule mining (ARM). There are different algorithms for mining frequent itemsets. One of the most common algorithms for this purpose is the Apriori algorithm that deduces association rules between different objects which describe how these objects are related together. It can be used in different application areas like market basket analysis, student’s courses selection process in the E-learning platforms, stock management, and medical applications. Nowadays, there is a great explosion of data that will increase the computational time in the Apriori algorithm. Therefore, there is a necessity to run the data-intensive algorithms in a parallel-distributed environment to achieve a convenient performance. In this paper, optimization of the Apriori algorithm using the Spark-based cuckoo filter structure (ASCF) is introduced. ASCF succeeds in removing the candidate generation step from the Apriori algorithm to reduce computational complexity and avoid costly comparisons. It uses the cuckoo filter structure to prune the transactions by reducing the number of items in each transaction. The proposed algorithm is implemented on the Spark in-memory processing distributed environment to reduce processing time. ASCF offers a great improvement in performance over the other candidate algorithms based on Apriori, where it achieves a time of only 5.8% of the state-of-the-art approach on the retail dataset with a minimum support of 0.75%.

show abstract

Section: Related Workmentioning

confidence: 99%

ASCF: Optimization of the Apriori Algorithm Using Spark-Based Cuckoo Filter Structure

Alrahwan,

Farouk

2024

International Journal of Intelligent Systems

View full text Add to dashboard Cite

show abstract

“…In another work, the authors used the Apriori algorithm in three different execution approaches IMRAprior-iAcc (Improved MapReduce Apriori Accelerated), DPC (Dynamic Passes Combined-Counting) and CPA (Complete Parallel Apriori) along with their adaption on Spark with different size datasets and varying cluster configuration [21]. Four performance metrics runtime, speed-up, size-up and scale-up are used for the performance evaluation of the Hadoop MapReduce and Spark.…”

Section: 1mentioning

confidence: 99%

Performance Comparison of Apache Spark and Hadoop for Machine Learning based iterative GBTR on HIGGS and Covid-19 Datasets

Sewal,

Hari Singh

2024

SCPE

View full text Add to dashboard Cite

In the realm of distributed computing frameworks, such as Apache Spark and MapReduce Hadoop, the efficacy of these frameworks varies across diverse applications and algorithms contingent upon distinctive evaluation metrics and critical parameters. This research paper diligently scrutinizes the extant body of research that compares these two frameworks concerning said evaluation metrics and parameters. Subsequently, it conducts empirical investigations to authenticate the performance of these frameworks in the context of an iterative Gradient Boosting Tree Regression (GBTR) algorithm. Remarkably, the comparative analyses in previous studies encompass a spectrum of iterative machine learning regression and classification techniques, batch processing, SQL, and Graph processing algorithms. Furthermore, numerous investigations have explored the application of machine learning algorithms encompassing logistic regression, Page Rank, K-Means, KNN, and the HiBench suite. This paper presents the comparison between the two distributed computing platforms on iterative GBTR for classification task on the HIGGS dataset from the physics domain and for the regression task on the Covid-19 dataset from the healthcare domain. The empirical findings corroborate that Apache Spark exhibits superior execution speed in iterative tasks when the available physical memory significantly exceeds the dataset size. Conversely, Hadoop outperforms Spark when dealing with substantial datasets or constrained physical memory resources.

show abstract

“…Table 1 summing up the platforms comparison of the big data. Each platform has its advantages over the other, therefore, selecting the best platform depends on the big data characteristics and requirements [18] [13] [61] [25].…”

Section: Big Data Clustering Platformsmentioning

confidence: 99%

Big Data Clustering Techniques Challenged and Perspectives: Review

Awad¹,

Hamad²

2023

IJCAI

View full text Add to dashboard Cite

Clustering in big data is considered a critical data mining and analysis technique. There are issues with adapting clustering algorithms to large amounts of data and new challenges brought by big data. As the size of big data is up to petabytes of data, and clustering methods have high processing costs, the challenge is how to handle this issue and utilize clustering techniques for big data efficiently. This study aims to investigate the recent advancement of clustering platforms and techniques to handle big data issues, from the early suggested techniques to today's novel solutions. The methodology and specific issues for building an effective clustering mechanism are presented and evaluated, followed by a discussion of the choices for enhancing clustering algorithms. A brief literature review of the recent advancement in clustering techniques has been presented to address each solution's main characteristics and drawbacks.Povzetek: Članek predstavlja pregled tehnik gručenja za velike podatke.

show abstract

Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark

Cited by 12 publications

References 18 publications

ASCF: Optimization of the Apriori Algorithm Using Spark-Based Cuckoo Filter Structure

ASCF: Optimization of the Apriori Algorithm Using Spark-Based Cuckoo Filter Structure

Performance Comparison of Apache Spark and Hadoop for Machine Learning based iterative GBTR on HIGGS and Covid-19 Datasets

Big Data Clustering Techniques Challenged and Perspectives: Review

Contact Info

Product

Resources

About