With the rapid growth of data scale and diversification of demand, people have an urgent desire to extract useful frequent itemset from datasets of different scales. It is no doubt that the traditional method can solve the problem. However, the relationships among datasets of different scales are not fully utilized. A fast approach proposed in this paper is as follows: the frequent itemsets on the large-scale data are directly inferred based on the frequent itemsets that are belonged small-scale datasets, instead of mined from the large-scale dataset again on condition that the frequent itemsets on the small-scale datasets have been mined. We conduct extensive experiments on one synthetic data and four UCI data sets. The experimental results show that our algorithm is significantly faster and consumes less memory than these leading algorithms. INDEX TERMS Up-scaling, up-scaling frequent itemsets, frequent itemset mining, data mining.
Multiscale brings great benefits for people to observe objects or problems from different perspectives. Multiscale clustering has been widely studied in various disciplines. However, most of the research studies are only for the numerical dataset, which is a lack of research on the clustering of nominal dataset, especially the data are nonindependent and identically distributed (Non-IID). Aiming at the current research situation, this paper proposes a multiscale clustering framework based on Non-IID nominal data. Firstly, the benchmark-scale dataset is clustered based on coupled metric similarity measure. Secondly, it is proposed to transform the clustering results from benchmark scale to target scale that the two algorithms are named upscaling based on single chain and downscaling based on Lanczos kernel, respectively. Finally, experiments are performed using five public datasets and one real dataset of the Hebei province of China. The results showed that the method can provide us not only competitive performance but also reduce computational cost.
Multiscale brings great benefits for people to observe objects or problems from different perspectives. It has practical significance for clustering on multiscale data. At present, there is a lack of research on the clustering of large-scale data under the premise that clustering results of small-scale datasets have been obtained. If one does cluster on large-scale datasets by using traditional methods, two disadvantages are as follows: (1) Clustering results of small-scale datasets are not utilized. (2) Traditional method will cause more running overhead. Aims at these shortcomings, this paper proposes a multiscale clustering framework based on DBSCAN. This framework uses DBSCAN for clustering small-scale datasets, then introduces algorithm Scaling-Up Cluster Centers (SUCC) generating cluster centers of large-scale datasets by merging clustering results of small-scale datasets, not mining raw large-scale datasets. We show experimentally that, compared to traditional algorithm DBACAN and leading algorithms DBSCAN++ and HDBSCAN, SUCC can provide not only competitive performance but reduce computational cost. In addition, under the guidance of experts, the performance of SUCC is more competitive in accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.