Dongmei Ren scite author profile

et al. 2003

Online Analytical Processing (OLAP) is an important application of data warehouses. With more and more spatial data being collected, such as remotely sensed images, geographical information, digital sky survey data, efficient OLAP for spatial data is in great demand. In this paper, we build up a new data warehouse structure -PD-cube. With PD-cube, OLAP operations and queries can be efficiently implemented. All these are accomplished based on the fast logical operations of Peano Trees (P-Trees * ). One of the P-tree variations, Predicate P-tree, is used to efficiently reduce data accesses by filtering out "bit holes" consisting of consecutive 0's. Experiments show that OLAP operations can be executed much faster than with traditional OLAP methods.

A vertical distance-based outlier detection method with local pruning

Rahal

Perrizo

et al. 2004

One person's noise is another person's signal". Outlier detection is used to clean up datasets and also to discover useful anomalies, such as criminal activities in electronic commerce, computer intrusion attacks, terrorist threats, agricultural pest infestations, etc. Thus, outlier detection is critically important in the information-based society. This paper focuses on finding outliers in large datasets using distance-based methods. First, to speedup outlier detections, we revise Knorr and Ng's distance-based outlier definition; second, a vertical data structure, instead of traditional horizontal structures, is adopted to facilitate efficient outlier detection further. We tested our methods against national hockey league dataset and show an order of magnitude of speed improvement compared to the contemporary distance-based outlier detection approaches.

A vertical outlier detection algorithm with clusters as by-product

Rahal

Perrizo

Outlier detection can lead to discovering unexpected and interesting knowledge, which is critically important to some areas such as monitoring of criminal activities in electronic commerce, credit card fraud, and the like. In this paper, we propose an efficient outlier detection method with clusters as by-product, which works efficiently for large datasets. Our contributions are: a) We introduce a Local Connective Factor (LCF); b) Based on LCF, we propose an outlier detection method which can efficiently detect outliers and group data into clusters in a one-time process. Our method does not require the beforehand clustering process, which is the first step in other state-of-the-art clustering-based outlier detection methods; c) The performance of our method is further improved by means of a vertical data representation, Ptrees 1 . We tested our method with real dataset. Our method shows around five-time speed improvements compared to the other contemporary clustering-based outlier-detection approaches.

A Scalable Vertical Model for Mining Association Rules

Rahal

Perrizo

2004

J. Info. Know. Mgmt.

Association rule mining (ARM) is the datamining process for finding all association rules in datasets matching user-defined measures of interest such as support and confidence. Usually, ARM proceeds by mining all frequent itemsets -a step known to be very computationally intensive -from which rules are then derived in a straight forward manner. In general, mining all frequent itemsets prunes the space by using the downward closure (or antimonotonicity) property of support which states that no itemset can be frequent unless all of its subsets are frequent. A large number of papers have addressed the problem of ARM but not many of them have focused on scalability over very large datasets (i.e. when datasets contain a very large number of transactions). In this paper, we propose a new model for representing data and mining frequent itemsets that is based on the P-tree technology for compression and faster logical operations over vertically structured data and on set enumeration trees for fast itemset enumeration. Experimental results presented hereinafter show big improvements for our approach over large datasets when compared to other contemporary approaches in the literature.

Efficient Density Clustering Method for Spatial Data

Pan

Wang

Zhang

et al. 2003

Abstract. Data mining for spatial data has become increasingly important as more and more organizations are exposed to spatial data from sources such as remote sensing, geographical information systems, astronomy, computer cartography, environmental assessment and planning, etc. Recently, density based clustering methods, such as DENCLUE, DBSCAN, OPTICS, have been published and recognized as powerful clustering methods for data mining. These approaches have run time complexity of ) log ( n n O when using spatial index techniques, R + tree and grid cell. However, these methods are known to lack scalability with respect to dimensionality. In this paper, a unique approach to efficient neighborhood search and a new efficient density based clustering algorithm using EIN-rings are developed. Our approach exploits compressed vertical data structures, Peano Trees (P-trees 1 ), and fast P-tree logical operations to accelerate the calculation of the density function within EIN-rings. This approach stands in contrast to the ubiquitous approach of vertically scanning horizontal data structures (records). The average run time complexity of our algo-. Our proposed method has comparable cardinality scalability with other density methods for small and medium size of data, but superior speed and dimensional scalability.