Online anomaly detection is an important step in data center management, requiring light-weight techniques that provide sufficient accuracy for subsequent diagnosis and management actions. This paper presents statistical techniques based on the Tukey and Relative Entropy statistics, and applies them to data collected from a production environment and to data captured from a testbed for multi-tier web applications running on server class machines. The proposed techniques are lightweight and improve over standard Gaussian assumptions in terms of performance.
No abstract
The class imbalance problem is pervasive in machine learning. To accurately classify the minority class, current methods rely on sampling schemes to close the gap between classes, or on the application of error costs to create algorithms which favor the minority class. Since the sampling schemes and costs must be specified, these methods are highly dependent on the class distributions present in the training set. This makes them difficult to apply in settings where the level of imbalance changes, such as in online streaming data. Often they cannot handle multi-class problems. We present a novel single-class algorithm called Class Conditional Nearest Neighbor Distribution (CCNND), which mitigates the effects of class imbalance through local geometric structure in the data. Our algorithm can be applied seamlessly to problems with any level of imbalance or number of classes, and new examples are simply added to the training set. We show that it performs as well as or better than top sampling and costweighting methods on four imbalanced datasets from the UCI Machine Learning Repository, and then apply it to streaming data from the oil and gas industry alongside a modified nearest neighbor algorithm. Our algorithm's competitive performance relative to the state-of-the-art, coupled with its extremely simple implementation and automatic adjustment for minority classes, demonstrates that it is worth further study.
As the website is a primary customer touch-point, millions are spent to gather web data about customer visits. Sadly, the trove of data and corresponding analytics have not lived up to the promise. Current marketing practice relies on ambiguous summary statistics or small-sample usability studies. Idiosyncratic browsing and low conversion (browser-tobuyer) make modeling hard. In this paper, we model browsing patterns (sequence of clicks) via Markov chain theory to predict users' propensity to buy within a session. We focus on model complexity, imputing missing values, data augmentation, and other attendant issues that impact performance. The paper addresses the following aspects; (1) Determine appropriate order of the Markov chain (assess the influence of prior history in prediction), (2) Impute missing transitions by exploiting the inherent link structure in the page sequences, (3) predict the likelihood of a purchase based on variable-length page sequences, and (4) Augment the training set of buyers (which is typically very small: ∼ 2%) by viewing the page transitions as a graph and exploiting its link structure to improve performance. The cocktail of solutions address important issues in practical digital marketing. Extensive analysis of data applied to a large commercial web-site shows that Markov chain based classifiers are useful predictors of user intent.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.