Anomaly detection has been an important research topic in data mining and machine learning. Many real-world applications such as intrusion or credit card fraud detection require an effective and efficient framework to identify deviated data instances. However, most anomaly detection methods are typically implemented in batch mode, and thus cannot be easily extended to large-scale problems without sacrificing computation and memory requirements. In this paper, we propose an online oversampling principal component analysis (osPCA) algorithm to address this problem, and we aim at detecting the presence of outliers from a large amount of data via an online updating technique. Unlike prior principal component analysis (PCA)-based approaches, we do not store the entire data matrix or covariance matrix, and thus our approach is especially of interest in online or large-scale problems. By oversampling the target instance and extracting the principal direction of the data, the proposed osPCA allows us to determine the anomaly of the target instance according to the variation of the resulting dominant eigenvector. Since our osPCA need not perform eigen analysis explicitly, the proposed framework is favored for online applications which have computation or memory limitations. Compared with the well-known power method for PCA and other popular anomaly detection algorithms, our experimental results verify the feasibility of our proposed method in terms of both accuracy and efficiency.
In dealing with large data sets, the reduced support vector machine (RSVM) was proposed for the practical objective to overcome some computational difficulties as well as to reduce the model complexity. In this paper, we study the RSVM from the viewpoint of sampling design, its robustness, and the spectral analysis of the reduced kernel. We consider the nonlinear separating surface as a mixture of kernels. Instead of a full model, the RSVM uses a reduced mixture with kernels sampled from certain candidate set. Our main results center on two major themes. One is the robustness of the random subset mixture model. The other is the spectral analysis of the reduced kernel. The robustness is judged by a few criteria as follows: 1) model variation measure; 2) model bias (deviation) between the reduced model and the full model; and 3) test power in distinguishing the reduced model from the full one. For the spectral analysis, we compare the eigenstructures of the full kernel matrix and the approximation kernel matrix. The approximation kernels are generated by uniform random subsets. The small discrepancies between them indicate that the approximation kernels can retain most of the relevant information for learning tasks in the full kernel. We focus on some statistical theory of the reduced set method mainly in the context of the RSVM. The use of a uniform random subset is not limited to the RSVM. This approach can act as a supplemental algorithm on top of a basic optimization algorithm, wherein the actual optimization takes place on the subset-approximated data. The statistical properties discussed in this paper are still valid.
In the era of Basel II a powerful tool for bankruptcy prognosis is vital for banks. The tool must be precise but also easily adaptable to the bank's objectives regarding the relation of false acceptances (Type I error) and false rejections (Type II error). We explore the suitability of smooth support vector machines (SSVM), and investigate how important factors such as the selection of appropriate accounting ratios (predictors), length of training period and structure of the training sample influence the precision of prediction. Moreover, we show that oversampling can be employed to control the trade-off between error types, and we compare SSVM with both logistic and discriminant analysis. Finally, we illustrate graphically how different models can be used jointly to support the decision-making process of loan officers. Copyright © 2008 John Wiley & Sons, Ltd.
This paper presents a novel sparse representation for robust face recognition. We advance both group sparsity and data locality and formulate a unified optimization framework, which produces a locality and group sensitive sparse representation (LGSR) for improved recognition. Empirical results confirm that our LGSR not only outperforms state-of-the-art sparse coding based image classification methods, our approach is robust to variations such as lighting, pose, and facial details (glasses or not), which are typically seen in real-world face recognition problems.
We introduced a three-tier architecture of intrusion detection system which consists of a blacklist, a whitelist and a multi-class support vector machine classifier. The first tier is the blacklist that will filter out the known attacks from the traffic and the whitelist identifies the normal traffics. The rest traffics, the anomalies detected by the whitelist, were then be classified by a multi-class SVM classifier into four categories: PROBE, DoS, R2L and U2R. Many data mining and machine learning techniques were applied here. We design this three-tier IDS based on the KDD'99 benchmark dataset. Our system has 94.71% intrusion detection rate and 93.52% diagnosis rate. The average cost for each connection is 0.1781. All of these results are better than those of KDD'99 winner's. Our three-tier architecture design also provides the flexibility for the practical usage. The network system administrator can add the new patterns into the blacklist and allows to do fine tuning of the whitelist according to the environment of their network system and security policy.
Along with the growth of cloud computing and mobile devices, the importance of client device identity concern over cloud environment is emerging. To provide a lightweight yet reliable method for device identification, an application layer approach based on clock skew fingerprint is proposed. The developed experimental platform adapts AJAX technology to collect the timestamps of client devices in the cloud server during connection time, then calculate the clock skews of client devices. Few methods based on linear regression and piecewise minimum algorithm are developed to optimize the precision and shorten timestamp collection process. A jump point detection scheme is also proposed to resolve the offset drifting problem, which is usually caused by switching network or temporary disconnection. Finally, two experiments are conducted to study the effectiveness of clock skew fingerprint, and the results illustrate that the false positive rate and the false negative rate, in the worst case, are both no more than 8% when the tolerance threshold is set appropriately.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.