Frequent Itemsets Mining for Big Data: A Comparative Analysis

Apiletti, Daniele; Baralis, Elena; Cerquitelli, Tania; Garza, Paolo; Pulvirenti, Fabio; Venturini, Luca

doi:10.1016/j.bdr.2017.06.006

Cited by 43 publications

(33 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Highlighted itemsets are the un-pruned candidates generated in optimized phases. It can be seen that C 4 ⊂ C' 4 and C 5 ⊂ C' 5 . When both types of phases count the support for C 3 , C 4 , C 5 or C 3 , C' 4 , C' 5 and check against min_sup, the same set of frequent itemsets are generated at the end of phases.…”

Section: Pass K+2mentioning

confidence: 99%

“…Candidate 4 and 5-itemsets are different and distinguished as C 4 & C' 4 and C 5 & C' 5 for simple phase and optimized phase respectively. Simple phase uses apriori-gen() to generate C 4 and C 5 while optimized phase uses non-apriori-gen() to generate C' 4 and C' 5 . No more candidate generation is possible further so, both stop here.…”

Section: Pass K+2mentioning

confidence: 99%

“…MapReduce is a parallel programming model of Hadoop designed for parallel processing of large volumes of data. Therefore, it is required to redesign the data mining algorithms on MapReduce framework in order to mine big data sets [5]. In MapReduce programming model, an application is called a MapReduce Job which consists of Mapper and Reducer and input datasets are stored in Hadoop Distributed File System (HDFS).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Performance optimization of MapReduce-based Apriori algorithm on Hadoop cluster

Singh

Garg

Mishra

2018

Computers & Electrical Engineering

View full text Add to dashboard Cite

Many techniques have been proposed to implement the Apriori algorithm on MapReduce framework but only a few have focused on performance improvement. FPC (Fixed Passes Combined-counting) and DPC (Dynamic Passes Combined-counting) algorithms combine multiple passes of Apriori in a single MapReduce phase to reduce the execution time. In this paper, we propose improved MapReduce based Apriori algorithms VFPC (Variable Size based Fixed Passes Combined-counting) and ETDPC (Elapsed Time based Dynamic Passes Combined-counting)over FPC and DPC. Further, we optimize the multi-pass phases of these algorithms by skipping pruning step in some passes, and propose Optimized-VFPC and Optimized-ETDPC algorithms. Quantitative analysis reveals that counting cost of additional un-pruned candidates produced due to skipped-pruning is less significant than reduction in computation cost due to the same. Experimental results show that VFPC and ETDPC are more robust and flexible than FPC and DPC whereas their optimized versions are more efficient in terms of execution time.

show abstract

Section: Pass K+2mentioning

confidence: 99%

Section: Pass K+2mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Performance optimization of MapReduce-based Apriori algorithm on Hadoop cluster

Singh

Garg

Mishra

2018

Computers & Electrical Engineering

View full text Add to dashboard Cite

show abstract

“…This is the frequency array for itemset {A, D}. Thus, the rule A, D ⇒ + has confidence 1 and support 0.5, and satisfies the minimum thresholds [2] . Rule A ⇒ + is not generated, as one of the subpatterns of A has already produced one rule.…”

Section: Algorithm 2: Cap-growthmentioning

confidence: 99%

Scaling associative classification for very large datasets

2017

Self Cite

View full text Add to dashboard Cite

IntroductionIn the recent years, Big Data have received much attention by both the academic and the industrial world, with the aim of fully leveraging the power of the information they hide. The dimensions on which very large datasets usually extend are mainly the size, i.e. the disk storage occupied, the volume, i.e. the number of records, the dimensionality, i.e. the number of features a record can have, and the domain, i.e. the number of distinct values a feature can take. A special effort has been dedicated to Machine learning algorithms, with a profusion of solutions to tackle the scalability problem, on some or all of the dimensions mentioned above.Scalability on the domain dimension is a special concern for the datasets in which most of the features are categorical. Categorical features have their values expressed in a discrete domain, and no concept of ordering or ranking can be assumed. Discrete or discretized features are a special case of categorical features where an order among the values is defined. The absence of a natural ordering increases the complexity of the treatment of categorical variables, as their values cannot be binned in groups or levels for example. AbstractSupervised learning algorithms are nowadays successfully scaling up to datasets that are very large in volume, leveraging the potential of in-memory cluster-computing Big Data frameworks. Still, massive datasets with a number of large-domain categorical features are a difficult challenge for any classifier. Most off-the-shelf solutions cannot cope with this problem. In this work we introduce DAC, a Distributed Associative Classifier. DAC exploits ensemble learning to distribute the training of an associative classifier among parallel workers and improve the final quality of the model. Furthermore, it adopts several novel techniques to reach high scalability without sacrificing quality, among which a preventive pruning of classification rules in the extraction phase based on Gini impurity. We ran experiments on Apache Spark, on a real large-scale dataset with more than 4 billion records and 800 million distinct categories. The results showed that DAC improves on a state-of-the-art solution in both prediction quality and execution time. Since the generated model is human-readable, it can not only classify new records, but also allow understanding both the logic behind the prediction and the properties of the model, becoming a useful aid for decision makers. Venturini et al. J Big Data (2017) Page 2 of 24 Venturini et al. J Big Data (2017) 4:44 Associative classifiers are a special category of Machine learning algorithms, where association rule mining is exploited for the purpose of classification. In the past, they have proved to be able to produce classification models of high quality and outperform state-of-art algorithms like decision trees [1]. Moreover, the model produced is readable, as it is made of association rules, can be debugged and even manually tuned if needed, by modifying or deleting specific rules. In a world ...

show abstract

“…Mining frequent itemsets from transactional databases play an important role in many data mining applications, e.g., social network mining ( Jiang, Leung, & Zhang, 2016;Moosavi, Jalali, Misaghian, Shamshirband, & Anisi, 2017 ), finding gene expression patterns ( Becquet, Blachon, Jeudy, Boulicaut, & Gandrillon, 2001;Creighton & Hanash, 2003;Cremaschi et al, 2015;Mallik, Mukhopadhyay, & Maulik, 2015 ), web log pattern mining ( Diwakar Tripathia & Edlaa, 2017;Han, Cheng, Xin, & Yan, 2007;Iváncsy, Renáta, & Vajk, 2006;Yu & Korkmaz, 2015 ). In recent years, many algorithms have been proposed for efficient mining of frequent itemsets ( Apiletti et al, 2017;Bodon, 2003;Burdick, Calimlim, Flannick, Gehrke, & Yiu, 2005;Gan, Lin, Fournier-Viger, Chao, & Zhan, 2017;Han, Pei, & Yin, 20 0 0;Kosters & Pijls, 2003;Liu, Lu, Yu, Wang, & Xiao, 2003;Pei, Tung, & Han, 2001;Uno, Kiyomi, & Arimura, 2004;Vo, Pham, Le, & Deng, 2017 ). These algorithms take a transactional database and support threshold (minimum itemset support) as input and mines complete set of frequent itemsets with support greater than minimum itemset support .…”

Section: Introductionmentioning

confidence: 99%

An efficient pattern growth approach for mining fault tolerant frequent itemsets

Bashir

2020

Expert Systems with Applications

View full text Add to dashboard Cite

a b s t r a c tMining fault tolerant (FT) frequent itemsets from transactional databases are computationally more expensive than mining exact matching frequent itemsets. Previous algorithms mine FT frequent itemsets using Apriori heuristic. Apriori-like algorithms generate exponential number of candidate itemsets including the itemsets that do not exist in the database. These algorithms require multiple scans of database for counting the support of candidate FT itemsets. In this paper we present a novel algorithm, which mines FT frequent itemsets using frequent pattern growth approach (FT-PatternGrowth). FT-PatternGrowth adopts a divide-and-conquer technique and recursively projects transactional database into a set of smaller projected transactional databases and mines FT frequent itemsets in each projected database by exploring only locally frequent items. This mines the complete set of FT frequent itemsets and substantially reduces those candidate itemsets that do not exist in the database. FT-PatternGrowth stores the transactional database in a highly condensed much smaller data structure called frequent pattern tree (FP-tree). The support of candidate itemsets are counted directly from the FP-tree without scanning the original database multiple times. This improves the processing speed of algorithm. Our experiments on benchmark databases indicates mining FT frequent itemsets using FT-PatternGrowth is highly efficient than Apriori-like algorithms.

show abstract

Frequent Itemsets Mining for Big Data: A Comparative Analysis

Cited by 43 publications

References 26 publications

Performance optimization of MapReduce-based Apriori algorithm on Hadoop cluster

Performance optimization of MapReduce-based Apriori algorithm on Hadoop cluster

Scaling associative classification for very large datasets

An efficient pattern growth approach for mining fault tolerant frequent itemsets

Contact Info

Product

Resources

About