Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster

Singh, Sudhakar; Garg, Rakhi; Mishra, Pragnyaban

doi:10.5120/ijca2015906632

Cited by 15 publications

(16 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Distributed computing using Hadoop and Spark frameworks became popular because of their parallel processing 42 . Several research works adopted the Hadoop with MapReduce programming engine for frequent itemset mining on big data 43‐45 . Finally, on a practical level, the deployment of our solution in a real setting within the CRM service, would allow us to see the contribution of “optimized” ARs compared to other rules and to show the advantage of using our framework in the decision‐making process.…”

Section: Discussionmentioning

confidence: 99%

Decision support based on optimized data mining techniques: Application to mobile telecommunication companies

Berkani

2020

Concurrency and Computation

View full text Add to dashboard Cite

Summary One of the most important challenges of African countries is the effective use of information and communication technologies and its generalization in the different sectors (education, economic, political, and so on). This will have a great impact on different aspects of society and economic activities by making everyday procedures easier and more efficient. In this same context, we are interested in this article by the proposition of a decision support framework based on the use of data mining (DM) techniques. Using DM in e‐government is the process of translating data to appropriate knowledge which can be useful for decision‐making. We propose a framework whose goal is the generation of association rules (ARs) for better decision‐making. This framework includes two approaches: (1) the first approach applies different DM algorithms and (2) the second approach optimizes the first one by considering two different metaheuristics: the Genetic algorithm and the Cuckoo search algorithm. A new relevance measure called “Weighted Dominance” has been considered to evaluate the quality of the generated ARs. Extensive experiments have been conducted using different datasets. The results obtained demonstrated the effectiveness of combining DM and optimization algorithms compared to the first approach. Finally, a case study related to an Algerian mobile phone company has been presented illustrating the use of our framework in the decision‐making process.

show abstract

Section: Discussionmentioning

confidence: 99%

Decision support based on optimized data mining techniques: Application to mobile telecommunication companies

Berkani

2020

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…The nodes of the Trie are simple which may be cached in, and the linear search is faster in cache memory. Singh et al [19] have investigated the performance of the same three data structures in the context of MapReduce-based Apriori on the Hadoop cluster. The authors have shown in their experimental results that Hash Table Trie performed much better than Trie on some datasets while Hash Tree was the worst one.…”

Section: Performance Analysismentioning

confidence: 99%

“…The data structure perspective to the performance of Spark-based Apriori has not been explored well. Singh et al [19] have evaluated the performance of the Apriori algorithm on the different data structures, but on the Hadoop MapReduce and not on the Spark.…”

Section: Introductionmentioning

confidence: 99%

A data structure perspective to the RDD-based Apriori algorithm on Spark

Singh

Mishra

et al. 2019

Int. j. inf. tecnol.

Self Cite

View full text Add to dashboard Cite

During the recent years, a number of efficient and scalable frequent itemset mining algorithms for big data analytics have been proposed by many researchers. Initially, MapReduce-based frequent itemset mining algorithms on Hadoop cluster were proposed. Although, Hadoop has been developed as a cluster computing system for handling and processing big data, but the performance of Hadoop does not meet the expectation for the iterative algorithms of data mining, due to its high I/O, and writing and then reading intermediate results in the disk. Consequently, Spark has been developed as another cluster computing infrastructure which is much faster than Hadoop due to its in-memory computation. It is highly suitable for iterative algorithms and supports batch, interactive, iterative, and stream processing of data. Many frequent itemset mining algorithms have been re-designed on the Spark, and most of them are Aprioribased. All these Spark-based Apriori algorithms use Hash Tree as the underlying data structure. This paper investigates the efficiency of various data structures for the Spark-based Apriori. Although, the data structure perspective has been investigated previously, but for MapReduce-based Apriori, and it must be re-investigated in the distributed computing environment of Spark. The considered underlying data structures are Hash Tree, Trie, and Hash Table Trie. The experimental results on the benchmark datasets show that the performance of Spark-based Apriori with Trie and Hash Table Trie are almost similar but both perform many times better than Hash Tree in the distributed computing environment of Spark.

show abstract

“…Some researchers use various data structures to improve the efficiency of association rule mining algorithms. Singh [35] tries to use a hash table, hash trie and hash table trie for candidate storage in Apriori MapReduce-based implementation. They find that hash table trie is most efficient than others in MapReduce context while it is not much efficient in a sequential approach.…”

Section: Related Workmentioning

confidence: 99%

Adaptive-Miner: an efficient distributed association rule mining algorithm on Spark

Rathee

Kashyap

2018

J Big Data

View full text Add to dashboard Cite

Extraction of valuable data from extensive datasets is a standout amongst the most vital exploration issues. Association rule mining is one of the highly used methods for this purpose. Finding possible associations between items in large transaction based datasets (finding frequent itemsets) is most crucial part of the association rule mining task. Many single-machine based association rule mining algorithms exist but the massive amount of data available these days is above the capacity of a single machine based algorithm. Therefore, to meet the demands of this ever-growing enormous data, there is a need for distributed association rule mining algorithm which can run on multiple machines. For these types of parallel/distributed applications, MapReduce is one of the best fault-tolerant frameworks. Hadoop is one of the most popular open-source software frameworks with MapReduce based approach for distributed storage and processing of large datasets using standalone clusters built from commodity hardware. But heavy disk I/O operation at each iteration of a highly iterative algorithm like Apriori makes Hadoop inefficient. A number of MapReduce based platforms are being developed for parallel computing in recent years. Among them, a platform, namely, Spark have attracted a lot of attention because of its inbuilt support to distributed computations. Therefore, we implemented a distributed association rule mining algorithm on Spark named as Adaptive-Miner which uses adaptive approach for finding frequent patterns with higher accuracy and efficiency. Adaptive-Miner uses an adaptive strategy based on the partial processing of datasets. Adaptive-Miner makes execution plans before every iteration and goes with the best suitable plan to minimize time and space complexity. Adpative-Miner is a dynamic association rule mining algorithm which change its approach based on the nature of dataset. Therefore, it is different and better than state-of-the-art static association rule mining algorithms. We conduct in-depth experiments to gain insight into the effectiveness, efficiency, and scalability of the Adaptive-Miner algorithm on Spark.

show abstract

Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster

Cited by 15 publications

References 17 publications

Decision support based on optimized data mining techniques: Application to mobile telecommunication companies

Decision support based on optimized data mining techniques: Application to mobile telecommunication companies

A data structure perspective to the RDD-based Apriori algorithm on Spark

Adaptive-Miner: an efficient distributed association rule mining algorithm on Spark

Contact Info

Product

Resources

About