Abstract. Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate clusters in large high dimensional datasets.
There has been increasing interest in the problem of building accurate data mining models over aggregate data, while protecting privacy at the level of individual records. One approach for this problem is to randomize the values in individual records, and only disclose the randomized values. The model is then built over the randomized data, after first compensating for the randomization (at the aggregate level). This approach is potentially vulnerable to privacy breaches: based on the distribution of the data, one may be able to learn with high confidence that some of the randomized records satisfy a specified property, even though privacy is preserved on average.In this paper, we present a new formulation of privacy breaches, together with a methodology, "amplification", for limiting them. Unlike earlier approaches, amplification makes it is possible to guarantee limits on privacy breaches without any knowledge of the distribution of the original data. We instantiate this methodology for the problem of mining association rules, and modify the algorithm from  to limit privacy breaches without knowledge of the data distribution. Next, we address the problem that the amount of randomization required to avoid privacy breaches (when mining association rules) results in very long transactions. By using pseudorandom generators and carefully choosing seeds such that the desired items from the original transaction are present in the randomized transaction, we can send just the seed instead of the transaction, resulting in a dramatic drop in communication and storage cost. Finally, we define new information measures that take privacy breaches into account when quantifying the amount of privacy preserved by randomization.
The widespread distribution and availability of small-scale sensors, actuators, and embedded processors is transforming the physical world into a computing platform. One such example is a sensor network consisting of a large number of sensor nodes that combine physical sensing capabilities such as temperature, light, or seismic sensors with networking and computation capabilities. Applications range from environmental control, warehouse inventory, and health care to military environments. Existing sensor networks assume that the sensors are preprogrammed and send data to a central frontend where the data is aggregated and stored for offline querying and analysis. This approach has two major drawbacks. First, the user cannot change the behavior of the system on the fly. Second, conservation of battery power is a major design factor, but a central system cannot make use of in-network programming, which trades costly communication for cheap local computation.In this paper, we introduce the Cougar approach to tasking sensor networks through declarative queries. Given a user query, a query optimizer generates an efficient query plan for in-network query processing, which can vastly reduce resource usage and thus extend the lifetime of a sensor network. In addition, since queries are asked in a declarative language, the user is shielded from the physical characteristics of the network. We give a short overview of sensor networks, propose a natural architecture for a data management system for sensor networks, and describe open research problems in this area.
We introduce a new algorithm for mining sequential patterns. Our algorithm is especially efficient when the sequential patterns in the database are very long. We introduce a novel depth-first search strategy that integrates a depth-first traversal of the search space with effective pruning mechanisms.Our implementation of the search strategy combines a vertical bitmap representation of the database with efficient support counting. A salient feature of our algorithm is that it incrementally outputs new frequent itemsets in an online fashion.In a thorough experimental evaluation of our algorithm on standard benchmark data from the literature, our algorithm outperforms previous work up to an order of magnitude.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.