Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms

Domingo, Carlos; Gavaldà, Ricard; Watanabe, Osamu

doi:10.1007/3-540-46846-3_16

Cited by 57 publications

(57 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, they tested this method with only one candidate itemset. Thus, as suggested in [8,9], more works need to be done to combine this adaptive sampling method with some mining algorithms, such as Apriori, and more experiments are required to test the cost of on-line sampling.…”

Section: Sequential Sequence Mining Algorithmsmentioning

confidence: 99%

“…MSPX can avoid or alleviate some problems inherent in the traditional single-sample methods [5,8,9,12,17,20,21]: (1) The performance of single-sample methods usually varies considerably from one run to another for the same mining task, because a bad sample can degrade the overall performance of the mining. On the other hand, by using multiple samples, MSPX effectively prevents the candidate generation from the overestimates made by a bad sample, so its performance is much more stable.…”

Section: Sampling In Mspxmentioning

confidence: 99%

“…In [8,9], they proposed how to dynamically determine the sample size based on the estimated support of the candidate itemsets, which is expected to be tighter than the sample size based on the Chernoff boundary. The basic idea is that a relatively small sample can be used if the support of a candidate itemset is far from the minimum support.…”

Section: Sequential Sequence Mining Algorithmsmentioning

confidence: 99%

“…Sampling has been used for mining frequent itemsets or sequences [5,8,9,12,17,20,21]. In [5], the FAST algorithm progressively refines the initial sample to obtain a small final sample and reports the set of frequent itemsets in the final sample as the result.…”

Section: Sequential Sequence Mining Algorithmsmentioning

confidence: 99%

See 3 more Smart Citations

Parallel mining of maximal sequential patterns using multiple samples

Luo

Chung

2010

J Supercomput

View full text Add to dashboard Cite

In this paper, we propose a new parallel algorithm, named PMSPX, which mines maximal frequent sequences by using multiple samples to exclude infrequent candidates effectively. A frequent sequence is maximal if none of its supersequences is frequent. Unlike the traditional single-sample methods developed for mining frequent itemsets, PMSPX uses multiple samples. Thus, it can avoid or alleviate some problems inherent in the single-sample methods. We theoretically analyzed how to increase the minimum support level to prevent misestimating infrequent candidates as frequent in the mining of samples. PMSPX is a parallel version of our sequential MSPX algorithm, and it is developed on a cluster of workstations. In PMSPX, each processing node uses MSPX to find a candidate set of local maximal frequent sequences first, independently from other processing nodes. Then, a top-down search is performed, starting with all the candidates, in a synchronous manner to identify real maximal frequent sequences. This asynchronous local mining followed by synchronous global mining approach minimizes the synchronization and communication among the processing nodes. Three database partitioning methods are proposed to distribute the database across the processing nodes, so that their workloads are balanced and the data skewness of the whole database is preserved in the data partition of each node. A comprehensive analysis was performed on PMSPX and existing parallel sequence mining algorithms, and extensive experiments were conducted on PMSPX. PMSPX demonstrates very good speedup and scaleup properties. It also requires less communication and synchronization than other parallel algorithms.

show abstract

Section: Sequential Sequence Mining Algorithmsmentioning

confidence: 99%

Section: Sampling In Mspxmentioning

confidence: 99%

Section: Sequential Sequence Mining Algorithmsmentioning

confidence: 99%

Section: Sequential Sequence Mining Algorithmsmentioning

confidence: 99%

See 2 more Smart Citations

Parallel mining of maximal sequential patterns using multiple samples

Luo

Chung

2010

J Supercomput

View full text Add to dashboard Cite

show abstract

“…A variety of procedures for selecting subsets from a large dataset are studied in [6]; and the results of using different techniques are empirically compared. A sequential sampling method for determining appropriate sample sizes for data reduction is proposed in [7]. The afore-mentioned data reduction approaches are mainly based on statistical sampling techniques, such as simple random sampling, stratified sampling or cluster sampling.…”

Section: Data Reduction: An Overviewmentioning

confidence: 99%

Adaptive data reduction for large-scale transaction data

Jacob

2008

European Journal of Operational Research

View full text Add to dashboard Cite

Overview on evolutionary subgroup discovery: analysis of the suitability and potential of the search performed by evolutionary algorithms

Carmona

González

Jesus

et al. 2014

WIREs Data Min & Knowl

View full text Add to dashboard Cite

Subgroup discovery (SD) is a descriptive data mining technique using supervised learning. In this article, we review the use of evolutionary algorithms (EAs) for SD. In particular, we will focus on the suitability and potential of the search performed by EAs in the development of SD algorithms. Future directions in the use of EAs for SD are also presented in order to show the advantages and benefits that this search strategy contribute to this task.

show abstract

Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms

Cited by 57 publications

References 12 publications

Parallel mining of maximal sequential patterns using multiple samples

Parallel mining of maximal sequential patterns using multiple samples

Adaptive data reduction for large-scale transaction data

Overview on evolutionary subgroup discovery: analysis of the suitability and potential of the search performed by evolutionary algorithms

Contact Info

Product

Resources

About