HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing

Sethi, Krishan Kumar; Ramesh, Dharavath

doi:10.1007/s11227-017-1963-4

Cited by 55 publications

(19 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The analysis results from each reducer were further aggregated to generate association rules for the entire dataset. In (13) the authors used Hybrid Frequent Itemset Mining (HFIM) technique in Spark to optimize the execution time. HFIM uses vertical and horizontal layout of the dataset to find the association.…”

Section: Background Studymentioning

confidence: 99%

An efficient FP-Growth based association rule mining algorithm using Hadoop MapReduce

Senthilkumar¹,

Prasad²

2020

IJST

View full text Add to dashboard Cite

Objectives: To achieve improved performance of FP-Growth based Association Rule Mining algorithm for massive data by effective utilization of storage, execution capability and improved partition technique within the Hadoop MapReduce framework. Methodology: The proposed methodology has four main phases: In the first phase, the item sets for finding the frequent pattern are encoded and thus minimizes the expensive operation for large data set. In the second phase, improved hash partitioning reduces the network overhead and improves the communication speed within the MapReduce phase for each item set. The effective usage of network bandwidth and storage is obtained by the impact of compression in the third phase. The use of combiner in final phase for frequent item set mining minimizes the overhead of reduce phase by finding the pattern in each partition and minimizes the overall execution time of the FP-Growth algorithm. Findings: FP-Growth based association rule mining algorithm is designed for parallel execution on distributed cluster of servers. Changes to the MapReduce implementation of FP-Growth with the impact of encoding. Improved hash partitioning, compression and configuration results in a significant performance gain with better improvement in execution time. Novelty/Improvements: According to the experimental results, the changes in storage and processing level within the MapReduce framework improves the overall performance of the parallel frequent item set mining in Hadoop cluster.

show abstract

Section: Background Studymentioning

confidence: 99%

An efficient FP-Growth based association rule mining algorithm using Hadoop MapReduce

Senthilkumar¹,

Prasad²

2020

IJST

View full text Add to dashboard Cite

show abstract

“…DFIMA (Distributed Frequent Itemset Mining Algorithm) [16] is a Spark-based Apriori algorithm that uses a Boolean vector for the frequent items and a matrix-based pruning method to reduce the size of candidates. HFIM (Hybrid Frequent Itemset Mining) [17] is also an Apriori-based algorithm along with vertical format of the dataset that reduces the scanning of the dataset. It uses both horizontal and vertical dataset obtained by eliminating infrequent items, where horizontal dataset is distributed across the worker nodes and vertical dataset is shared.…”

Section: Related Workmentioning

confidence: 99%

“…So, a number of data mining and machine learning algorithms have been re-designed on the Spark RDD framework. FIM algorithms on the Spark have been also proposed by many authors [12][13][14][15][16][17][18], where most of the efforts have been made on the efficient implementations of Apriori-based FIM algorithm on the Spark. The efficiency of the Spark-based Apriori algorithms extensively depend on the way it is parallelized on the Spark, and the underlying data structures used to store and compute frequent itemsets.…”

Section: Introductionmentioning

confidence: 99%

A data structure perspective to the RDD-based Apriori algorithm on Spark

Singh

Mishra

et al. 2019

Int. j. inf. tecnol.

View full text Add to dashboard Cite

During the recent years, a number of efficient and scalable frequent itemset mining algorithms for big data analytics have been proposed by many researchers. Initially, MapReduce-based frequent itemset mining algorithms on Hadoop cluster were proposed. Although, Hadoop has been developed as a cluster computing system for handling and processing big data, but the performance of Hadoop does not meet the expectation for the iterative algorithms of data mining, due to its high I/O, and writing and then reading intermediate results in the disk. Consequently, Spark has been developed as another cluster computing infrastructure which is much faster than Hadoop due to its in-memory computation. It is highly suitable for iterative algorithms and supports batch, interactive, iterative, and stream processing of data. Many frequent itemset mining algorithms have been re-designed on the Spark, and most of them are Aprioribased. All these Spark-based Apriori algorithms use Hash Tree as the underlying data structure. This paper investigates the efficiency of various data structures for the Spark-based Apriori. Although, the data structure perspective has been investigated previously, but for MapReduce-based Apriori, and it must be re-investigated in the distributed computing environment of Spark. The considered underlying data structures are Hash Tree, Trie, and Hash Table Trie. The experimental results on the benchmark datasets show that the performance of Spark-based Apriori with Trie and Hash Table Trie are almost similar but both perform many times better than Hash Tree in the distributed computing environment of Spark.

show abstract

“…HFIM algorithm [29] is another Spark-based implementation of the Apriori algorithm for various data sets, which uses the vertical layout of the data set to solve the problem of scanning the dataset in each iteration. It is implemented on the Spark framework, integrating the concept of resilient distributed datasets and in-memory processing to optimize the processing time of the operation.…”

Section: Related Workmentioning

confidence: 99%

“…In this section, the DisPrePost algorithm has been compared to two advanced algorithms, HPrePostPlus [26] and the well-known HFIM [29]. DisPrePost is the first implementation of the PrePost algorithm in the Spark framework, HPrePostPlus is a recent implementation of the Hadoop-based PrePost parallel algorithm [26] with good results, and HFIM is a typical implementation of the Sparkbased Apriori parallel algorithm [29] with good performance. We evaluated speed performance by analyzing runtime and scalability.…”

Section: Performance Evaluationmentioning

confidence: 99%

An Efficient Distributed Frequent Itemset Mining Algorithm Based on Spark for Big Data

Rochd¹,

Hafidi²

2019

IJIES

View full text Add to dashboard Cite

Frequent item exploration is a fundamental element in many data mining problems aimed at finding interesting models in the data. Recently, the PrePost algorithm, a new algorithm for extraction frequent element sets based on the idea of N-lists, which in most cases surpasses other current state-of-the-art algorithms, has been introduced. The PrePost algorithm's performance deteriorates when it comes to handling big data. Nevertheless, the current existing PrePost algorithms in place implemented with the MapReduce model are not sufficiently powerful for iterative computation. To reduce IO overhead and take advantage of cluster memory, this article offers an enhanced version of PrePost, the Distributed PrePost (DisPrePost), a parallel algorithm built on the Spark framework, which incorporates the concept of resilient distributed datasets and performs in-memory processing to optimize the execution time of operation, that also utilises a HashMap to further refine the N-list creation process. Experience has shown that the DisPrePost algorithm is more efficient and scalable than the two advanced state-of-the-art methods HPrePostPlus and the well-known algorithm HFIM.

show abstract

HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing

Cited by 55 publications

References 21 publications

An efficient FP-Growth based association rule mining algorithm using Hadoop MapReduce

An efficient FP-Growth based association rule mining algorithm using Hadoop MapReduce

A data structure perspective to the RDD-based Apriori algorithm on Spark

An Efficient Distributed Frequent Itemset Mining Algorithm Based on Spark for Big Data

Contact Info

Product

Resources

About