In phase I of statistical process control (SPC), control charts are often used as outlier detection methods to assess process stability. Many of these methods require estimation of the covariance matrix, are computationally infeasible, or have not been studied when the dimension of the data, p, is large. We propose the one-class peeling (OCP) method, a flexible framework that combines statistical and machine learning methods to detect multiple outliers in multivariate data. The OCP method can be applied to phase I of SPC, does not require covariance estimation, and is well suited to high-dimensional data sets with a high percentage of outliers. Our empirical evaluation suggests that the OCP method performs well in high dimensions and is computationally more efficient and robust than existing methodologies. We motivate and illustrate the use of the OCP method in a phase I SPC application on a N = 354, p = 1917 dimensional data set containing Wikipedia search results for National Football League (NFL) players, teams, coaches, and managers. The example data set and R functions, OCP.R and OCPLimit.R, to compute the respective OCP distances and thresholds are available in the supplementary materials.
The k-chart, based on support vector data description, has received recent attention in the literature. We review four different methods for choosing the bandwidth parameter, s, when the k-chart is designed using the Gaussian kernel. We provide results of extensive Phase I and Phase II simulation studies varying the method of choosing the bandwidth parameter along with the size and distribution of sample data. In very limited cases, the k-chart performed as desired. In general, we are unable to recommend the k-chart for use in a Phase I or Phase II process monitoring study in its current form.Inspired by support vector machines (SVM), SVDD is an unsupervised learning method used to give a description (or produce a boundary) around a data set. Whereas SVM separates classes by maximizing the margin (the distance between the closest objects of two classes), SVDD maximizes the minimum volume surrounding a data set and relies on user-supplied parameters to determine how large the boundary should be. In SVM, the boundary between the two classes is defined by only a few points of each class, called the support vectors. Similarly, in SVDD, the boundary surrounding a data set is also defined only by the points farthest from the center of the data. These boundary-defining points are referred to as the support vectors. To obtain the SVDD hypersphere, defined by a center and a radius R, we minimize R using F.R, , i / D R 2 C C X
Boosting refers to methods that create a sequence of classifiers that perform at least slightly better than random (weak learners) and combine them into a highly accurate ensemble model (strong learners) through weighted voting. There is sufficient empirical evidence to suggest that the performance of boosting methods is superior to that of individual classifiers. In the bias-variance decomposition framework, it has been demonstrated that boosting algorithms typically reduce bias for learning problems, and in some instances reduce the variance. In addition, even when combining a large number of weak learners, boosting algorithms can be very robust to overfitting, in most instances having lower generalization error than other competing ensemble methodologies, such as bagging and random forests.
Ensemble models refer to methods that combine a typically large number of classifiers into a compound prediction. The output of an ensemble method is the result of fitting a base-learning algorithm to a given data set, and obtaining diverse answers by reweighting the observations or by resampling them using a given probabilistic selection. A key challenge of using ensembles in large-scale multidimensional data lies in the complexity and the computational burden associated with them. The models created by ensembles are often difficult, if not impossible, to interpret and their implementation requires more computational power than single classifiers. Recent research effort in the field has concentrated in reducing ensemble size, while maintaining their predictive accuracy. We propose a method to prune an ensemble solution by optimizing its margin distribution, while increasing its diversity. The proposed algorithm results in an ensemble that uses only a fraction of the original classifiers, with improved or similar generalization performance. We analyze and test our method on both synthetic and real data sets. The simulations show that the proposed method compares favorably to the original ensemble solutions and to other existing ensemble pruning methodologies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.