Data mining in distributed environment: a survey

Gan, Wensheng; Lin, Jerry Chun‐Wei; Chao, Han-Chieh; Zhan, Justin

doi:10.1002/widm.1216

Cited by 121 publications

(57 citation statements)

References 135 publications

(278 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Pattern (i.e., itemset, rule, and sequence) mining [20,33] is a kind of well-studied data mining and analytics model. The applications of pattern mining models are very extensive, and details can be referred to in the survey literature [11,15,18,19]. A great effort has been put forth by the data mining community to discover frequent patterns from itemset-based data, such as Apriori [3] and FP-growth [20] methods.…”

Section: Frequency-based Mining On Sequencesmentioning

confidence: 99%

“…Knowing the useful patterns and auxiliary knowledge from sequences/events can benefit a number of applications, such as web access analysis, event prediction, time-aware recommendation, and DNA detection [11]. Up to now, research has been conducted on mining interesting patterns from transaction or sequential data [11,15,20,33]. However, most of them are based on the co-occurrence frequency of patterns.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

ProUM: Projection-based utility mining on sequence data

Gan

Lin

Zhang

et al. 2020

Information Sciences

Self Cite

View full text Add to dashboard Cite

Utility is an important concept in economics. A variety of applications consider utility in real-life situations, which has lead to the emergence of utility-oriented mining (also called utility mining) in the recent decade. Utility mining has attracted a great amount of attention, but most of the existing studies have been developed to deal with itemset-based data. Time-ordered sequence data is more commonly seen in real-world situations, which is different from itemset-based data. Since they are time-consuming and require large amount of memory usage, current utility mining algorithms still have limitations when dealing with sequence data. In addition, the mining efficiency of utility mining on sequence data still needs to be improved, especially for long sequences or when there is a low minimum utility threshold. In this paper, we propose an efficient Projection-based Utility Mining (ProUM) approach to discover high-utility sequential patterns from sequence data. The utility-array structure is designed to store the necessary information of the sequence-order and utility. ProUM can significantly improve the mining efficiency by utilizing the projection technique in generating utility-array, and it effectively reduces the memory consumption. Furthermore, a new upper bound named sequence extension utility is proposed and several pruning strategies are further applied to improve the efficiency of ProUM. By taking utility theory into account, the derived high-utility sequential patterns have more insightful and interesting information than other kinds of patterns. Experimental results showed that the proposed ProUM algorithm significantly outperformed the state-of-the-art algorithms in terms of execution time, memory usage, and scalability.

show abstract

Section: Frequency-based Mining On Sequencesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

ProUM: Projection-based utility mining on sequence data

Gan

Lin

Zhang

et al. 2020

Information Sciences

Self Cite

View full text Add to dashboard Cite

show abstract

“…There are some research opportunities for iHUIM to handle large‐scale databases: how to design a parallelized iHUIM algorithm, how to develop a iHUIM algorithm based on the existing big data technologies (e.g., MapReduce (Dean & Ghemawat, ), Spark (Zaharia, Chowdhury, Das, Dave, & Ma, )). Besides, other promising areas can be considered such as designing parallel, distributed, multicore, and GPU‐based algorithms (Gan, Lin, Chao, & Zhan, ) for iHUIM.…”

Section: Opportunities For Ihuimmentioning

confidence: 99%

A survey of incremental high‐utility itemset mining

Gan

Lin

Fournier-Viger

et al. 2018

WIREs Data Min & Knowl

Self Cite

130

View full text Add to dashboard Cite

Traditional association rule mining has been widely studied. But it is unsuitable for real‐world applications where factors such as unit profits of items and purchase quantities must be considered. High‐utility itemset mining (HUIM) is designed to find highly profitable patterns by considering both the purchase quantities and unit profits of items. However, most HUIM algorithms are designed to be applied to static databases. But in real‐world applications such as market basket analysis and business decision‐making, databases are often dynamically updated by inserting new data such as customer transactions. Several researchers have proposed algorithms to discover high‐utility itemsets (HUIs) in dynamically updated databases. Unlike batch algorithms, which always process a database from scratch, incremental high‐utility itemset mining (iHUIM) algorithms incrementally update and output HUIs, thus reducing the cost of discovering HUIs. This paper provides an up‐to‐date survey of the state‐of‐the‐art iHUIM algorithms, including Apriori‐based, tree‐based, and utility‐list‐based approaches. To the best of our knowledge, this is the first survey on the mining task of incremental high‐utility itemset mining. The paper also identifies several important issues and research challenges for iHUIM. WIREs Data Mining Knowl Discov 2018, 8:e1242. doi: 10.1002/widm.1242 This article is categorized under: Algorithmic Development > Association Rules Application Areas > Data Mining Software Tools Fundamental Concepts of Data and Knowledge > Knowledge Representation

show abstract

“…By enhancement of the intermediate result storage with in-memory computations and generalization of the MapReduce pattern with a more flexible directed acyclic graph (DAG), SPARK has gained a high popularity (Landset, Khoshgoftaar, Richter, & Hasanin, 2015). Especially the in-memory processing of big data is a key technology for fast and responsive mining and can be found in many commercial products like SAP HANA or SAS products (Gan, Lin, Chao, & Zhan, 2017). For machine learning SPARK offers its own machine learning library MLlib but the open source aspect also allows the development of third party frameworks like H2O.…”

Section: Historical Development and State-of-the-artmentioning

confidence: 99%

Data mining tools

Bartschat

Reischl

Mikut

2019

WIREs Data Min & Knowl

View full text Add to dashboard Cite

The development and application of data mining algorithms requires the use of powerful software tools. With challenges such as big data encountered in economy or gene sequencing for life science, data mining is important for daily problems as well as specialized fields. However, the large variety of requirements and user groups lead to a huge number and diversity of software tools. We give an overview by discussing the historical development and presenting a range of existing state‐of‐the‐art data mining and related tools. This paper is an update of our previous article from 2011 following the encyclopedic aspect of Wiley Interdisciplinary Reviews to include new findings or references and changing outdated information. However, since the paper should be able to stand alone, it includes many still valid elements of the previous article. Following the original paper, we propose criteria for the tool categorization based on different user groups, data structures, data mining tasks and methods, visualization and interaction styles, import and export options for data and models, platforms, and license policies. These criteria are then used to classify data mining tools into nine different categories. The typical characteristics of these types are explained and a selection of the most important tools is categorized. This article is categorized under: Application Areas > Data Mining Software Tools

show abstract

Data mining in distributed environment: a survey

Cited by 121 publications

References 135 publications

ProUM: Projection-based utility mining on sequence data

ProUM: Projection-based utility mining on sequence data

A survey of incremental high‐utility itemset mining

Data mining tools

Contact Info

Product

Resources

About