We present a new approach for dealing with distribution change and concept drift when learning from data sequences that may vary with time. We use sliding windows whose size, instead of being fixed a priori, is recomputed online according to the rate of change observed from the data in the window itself. This delivers the user or programmer from having to guess a time-scale for change. Contrary to many related works, we provide rigorous guarantees of performance, as bounds on the rates of false positives and false negatives.Using ideas from data stream algorithmics, we develop a time-and memory-efficient version of this algorithm, called ADWIN2. We show how to combine ADWIN2 with the Naïve Bayes (NB) predictor, in two ways: one, using it to monitor the error rate of the current model and declare when revision is necessary and, two, putting it inside the NB predictor to maintain up-to-date estimations of conditional probabilities in the data. We test our approach using synthetic and real data streams and compare them to both fixed-size and variable-size window strategies with good results.
Advanced analysis of data streams is quickly becoming a key area of data mining research as the number of applications demanding such processing increases. Online mining when such data streams evolve over time, that is when concepts drift or change completely, is becoming one of the core issues. When tackling non-stationary concepts, ensembles of classifiers have several advantages over single classifier methods: they are easy to scale and parallelize, they can adapt to change quickly by pruning under-performing parts of the ensemble, and they therefore usually also generate more accurate concept descriptions. This paper proposes a new experimental data stream framework for studying concept drift, and two new variants of Bagging: ADWIN Bagging and Adaptive-Size Hoeffding Tree (ASHT) Bagging. Using the new experimental framework, an evaluation study on synthetic and real-world datasets comprising up to ten million examples shows that the new ensemble methods perform very well compared to several known methods.
We show that the class of all circuits is exactly learnable in randomized expected polynomial time using weak subset and weak superset queries. This is a consequence of the following result which we consider to be of independent interest: circuits are exactly learnable in randomized expected polynomial time with equivalence queries and the aid of an NP-oracle. We also show that circuits are exactly learnable in deterministic polynomial time with equivalence queries and a P 3 -oracle. The hypothesis class for the above learning algorithms is the class of circuits of larger but polynomially related size. Also, the algorithms can be adapted to learn the class of DNF formulas with hypothesis class consisting of depth-3 7-6-7 formulas (by the work of Angluin this is optimal in the sense that the hypothesis class cannot be reduced to DNF formulas, i.e., depth-2 6-7 formulas). We also investigate the power of an NP-oracle in the context of learning with membership queries. We show that there are deterministic learning algorithms that use membership queries and an NP-oracle to learn: monotone boolean functions in time polynomial in the DNF size and CNF size of the target formula; and the class of O(log n)-DNF & O(log n)-CNF formulas in time polynomial in n. We also show that, with an NP-oracle and membership queries, there is a randomized expected polynomial time algorithm that learns any class that is learnable from membership queries with unlimited computational power. Using similar techniques, we show the following both for membership and for equivalence queries (when the hypotheses allowed are precisely the concepts in the class); any class learnable with unbounded computational-power is learnable in deterministic polynomial time with a p 5 -oracle. Furthermore, we identify the combinatorial properties that completely determine learnability in this information-theoretic sense. Finally we point out a consequence of our result in structural complexity theory showing that if every NP set has polynomial-size circuits then the polynomial hierarchy collapses to ZPPNP . ] 1996Academic Press, Inc.
As energy-related costs have become a major economical factor for IT infrastructures and data-centers, companies and the research community are being challenged to find better and more efficient power-aware resource management strategies. There is a growing interest in "Green" IT and there is still a big gap in this area to be covered.In order to obtain an energy-efficient data center, we propose a framework that provides an intelligent consolidation methodology using different techniques such as turning on/off machines, power-aware consolidation algorithms, and machine learning techniques to deal with uncertain information while maximizing performance. For the machine learning approach, we use models learned from previous system behaviors in order to predict power consumption levels, CPU loads, and SLA timings, and improve scheduling decisions. Our framework is vertical, because it considers from watt consumption to workload features, and cross-disciplinary, as it uses a wide variety of techniques.We evaluate these techniques with a framework that covers the whole control cycle of a real scenario, using a simulation with representative heterogeneous workloads, and we measure the quality of the results according to a set of metrics focused toward our goals, besides traditional policies. The results obtained indicate that our approach is close to the optimal placement and behaves better when the level of uncertainty increases.
Scalability is a key requirement for any KDD and data mining algorithm, and one of the biggest research challenges is to develop methods that allow to use large amounts of data. One possible approach for dealing with huge amounts of data is to take a random sample and do data mining on it, since for many data mining applications approximate answers are acceptable. However, as argued by several researchers, random sampling is difficult to use due to the difficulty of determining an appropriate sample size. In this paper, we take a sequential sampling approach for solving this difficulty, and propose an adaptive sampling algorithm that solves a general problem covering many problems arising in applications of discovery science. The algorithm obtains examples sequentially in an on-line fashion, and it determines from the obtained examples whether it has already seen a large enough number of examples. Thus, sample size is not fixed a priori; instead, it adaptively depends on the situation. Due to this adaptiveness, if we are not in a worst case situation as fortunately happens in many practical applications, then we can solve the problem with a number of examples much smaller than the required in the worst case. For illustrating the generality of our approach, we also describe how different instantiations of it can be applied to scale up knowledge discovery problems that appear in several areas.
The growing complexity of software systems is resulting in an increasing number of software faults. According to the literature, software faults are becoming one of the main sources of unplanned system outages, and have an impor tant impact on company benefits and image. For this rea son, a lot of techniques (such as clustering, fail-over tech niques, or server redundancy) have been proposed to avoid software failures, and yet they still happen. Many software failures are those due to the software aging phenomena. In this work, we present a detailed evaluation of our chosen machine learning prediction algorithm (M5P) in front of dynamic and non-deterministic software aging. We have tested our prediction model on a three-tier web 12EE ap plication achieving acceptable prediction accuracy against complex scenarios with small training data sets. Further more, we have found an interesting approach to help to de termine the root cause failure: The model generated by ma chine learning algorithms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.