Guang Cheng scite author profile

A B S T R A C T Intrusion detection system (IDS) is one of extensively used techniques in a network topology to safeguard the integrity and availability of sensitive assets in the protected systems. Although many supervised and unsupervised learning approaches from the field of machine learning have been used to increase the efficacy of IDSs, it is still a problem for existing intrusion detection algorithms to achieve good performance. First, lots of redundant and irrelevant data in high-dimensional datasets interfere with the classification process of an IDS. Second, an individual classifier may not perform well in the detection of each type of attacks. Third, many models are built for stale datasets, making them less adaptable for novel attacks. Thus, we propose a new intrusion detection framework in this paper, and this framework is based on the feature selection and ensemble learning techniques. In the first step, a heuristic algorithm called CFS-BA is proposed for dimensionality reduction, which selects the optimal subset based on the correlation between features. Then, we introduce an ensemble approach that combines C4.5, Random Forest (RF), and Forest by Penalizing Attributes (Forest PA) algorithms. Finally, voting technique is used to combine the probability distributions of the base learners for attack recognition. The experimental results, using NSL-KDD, AWID, and CIC-IDS2017 datasets, reveal that the proposed CFS-BA-Ensemble method is able to exhibit better performance than other related and state of the art approaches under several metrics.

show abstract

Simultaneous Inference for High-Dimensional Linear Models

Zhang

Cheng

2017

Journal of the American Statistical Association

122

116

View full text Add to dashboard Cite

This paper proposes a bootstrap-assisted procedure to conduct simultaneous inference for high dimensional sparse linear models based on the recent de-sparsifying Lasso estimator (van de Geer et al. 2014). Our procedure allows the dimension of the parameter vector of interest to be exponentially larger than sample size, and it automatically accounts for the dependence within the de-sparsifying Lasso estimator. Moreover, our simultaneous testing method can be naturally coupled with the margin screening (Fan and Lv 2008) to enhance its power in sparse testing with a reduced computational cost, or with the step-down method (Romano and Wolf 2005) to provide a strong control for the family-wise error rate. In theory, we prove that our simultaneous testing procedure asymptotically achieves the pre-specified significance level, and enjoys certain optimality in terms of its power even when the model errors are non-Gaussian. Our general theory is also useful in studying the support recovery problem. To broaden the applicability, we further extend our main results to generalized linear models with convex loss functions. The effectiveness of our methods is demonstrated via simulation studies.95% 99% 95% 99% 95% 99% 95% 99% 95% 99% 95% 99% p = 120 t(4)/ √ 2 NST cv Cov 0.82 0.94 0.97 0.99 0.95 0.99 0.91 0.97 0.93 0.98 0.93 0.98 Len 0.99 1.22 1.49 1.67 1.50 1.67 0.97 1.18 1.49 1.67 1.49 1.67 ST cv Cov 0.82 0.93 0.97 0.99 0.96 0.99 0.92 0.97 0.92 0.98 0.92 0.98 Len 0.97 1.19 1.48 1.64 1.48 1.64 0.96 1.18 1.46 1.62 1.46 1.63 EX cv Cov NA NA 0.98 1.00 0.97 0.99 NA NA 0.94 0.98 0.93 0.98 Len NA NA 1.51 1.69 1.51 1.69 NA NA 1.49 1.66 1.49 1.67 Gamma NST cv Cov 0.84 0.93 0.96 0.99 0.94 0.98 0.91 0.97 0.93 0.98 0.93 0.98 Len 0.99 1.22 1.50 1.67 1.50 1.67 0.97 1.18 1.50 1.68 1.50 1.68 ST cv Cov 0.82 0.92 0.97 0.99 0.95 0.98 0.90 0.97 0.92 0.98 0.92 0.98 Len 0.97 1.19 1.48 1.65 1.48 1.65 0.97 1.18 1.46 1.63 1.46 1.63 EX cv Cov NA NA 0.97 0.99 0.96 0.99 NA NA 0.93 0.99 0.93 0.99 Len NA NA 1.51 1.69 1.51 1.69 NA NA 1.49 1.67 1.49 1.67 p = 500 t(4)/ √ 2 NST cv Cov 0.76 0.90 0.96 0.99 0.94 0.98 0.92 0.98 0.92 0.97 0.92 0.97 Len 0.89 1.09 1.47 1.62 1.47 1.62 0.97 1.19 1.65 1.82 1.65 1.82 ST cv Cov 0.77 0.90 0.97 0.99 0.95 0.98 0.92 0.97 0.92 0.97 0.91 0.97 Len 0.88 1.08 1.46 1.60 1.46 1.60 0.97 1.18 1.62 1.77 1.62 1.77 EX cv Cov NA NA 0.98 0.99 0.96 0.98 NA NA 0.92 0.97 0.92 0.97 Len NA NA 1.48 1.63 1.48 1.63 NA NA 1.64 1.81 1.64 1.81 Gamma NST cv Cov 0.77 0.90 0.98 0.99 0.96 0.98 0.91 0.97 0.95 0.98 0.94 0.98 Len 0.90 1.10 1.49 1.63 1.49 1.64 0.98 1.19 1.66 1.83 1.66 1.83 ST cv Cov 0.77 0.90 0.98 0.99 0.96 0.98 0.90 0.97 0.93 0.97 0.93 0.97 Len 0.89 1.09 1.47 1.61 1.47 1.61 0.97 1.19 1.63 1.78 1.63 1.78 EX cv Cov NA NA 0.98 1.00 0.96 0.99 NA NA 0.94 0.98 0.94 0.98 Len NA NA 1.49 1.64 1.49 1.64 NA NA 1.65 1.82 1.65 1.82Note: The tuning parameters λ j s in the nodewise Lasso are chosen to be the same via 10-fold cross-validation among all nodewise regressions for NST cv , ST cv , and EX cv . t(4)/ √ 2 and Gamma denote the studentized t(4) dis...

show abstract

Provable Sparse Tensor Decomposition

Sun

Han

et al. 2016

106

View full text Add to dashboard Cite

Summary We propose a novel sparse tensor decomposition method, namely the tensor truncated power method, that incorporates variable selection in the estimation of decomposition components. The sparsity is achieved via an efficient truncation step embedded in the tensor power iteration. Our method applies to a broad family of high dimensional latent variable models, including high dimensional Gaussian mixtures and mixtures of sparse regressions. A thorough theoretical investigation is further conducted. In particular, we show that the final decomposition estimator is guaranteed to achieve a local statistical rate, and we further strengthen it to the global statistical rate by introducing a proper initialization procedure. In high dimensional regimes, the statistical rate obtained significantly improves those shown in the existing non‐sparse decomposition methods. The empirical advantages of tensor truncated power are confirmed in extensive simulation results and two real applications of click‐through rate prediction and high dimensional gene clustering.

show abstract

Local and global asymptotic inference in smoothing spline models

Shang¹,

Cheng²

2013

Ann. Statist.

View full text Add to dashboard Cite

This article studies local and global inference for smoothing spline estimation in a unified asymptotic framework. We first introduce a new technical tool called functional Bahadur representation, which significantly generalizes the traditional Bahadur representation in parametric models, that is, Bahadur [Ann. Inst. Statist. Math. 37 (1966) 577-580]. Equipped with this tool, we develop four interconnected procedures for inference: (i) pointwise confidence interval; (ii) local likelihood ratio testing; (iii) simultaneous confidence band; (iv) global likelihood ratio testing. In particular, our confidence intervals are proved to be asymptotically valid at any point in the support, and they are shorter on average than the Bayesian confidence intervals proposed by Wahba [J. R. Stat. Soc. Ser. B Stat. Methodol. 45 (1983) 133-150] and Nychka [J. Amer. Statist. Assoc. 83 (1988) 1134-1143]. We also discuss a version of the Wilks phenomenon arising from local/global likelihood ratio testing. It is also worth noting that our simultaneous confidence bands are the first ones applicable to general quasi-likelihood models. Furthermore, issues relating to optimality and efficiency are carefully addressed. As a by-product, we discover a surprising relationship between periodic and nonperiodic smoothing splines in terms of inference.Comment: Published in at http://dx.doi.org/10.1214/13-AOS1164 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

show abstract

Linear or Nonlinear? Automatic Structure Discovery for Partially Linear Models

Zhang

Cheng

Liu

2011

Journal of the American Statistical Association

157

View full text Add to dashboard Cite

Partially linear models provide a useful class of tools for modeling complex data by naturally incorporating a combination of linear and nonlinear effects within one framework. One key question in partially linear models is the choice of model structure, that is, how to decide which covariates are linear and which are nonlinear. This is a fundamental, yet largely unsolved problem for partially linear models. In practice, one often assumes that the model structure is given or known and then makes estimation and inference based on that structure. Alternatively, there are two methods in common use for tackling the problem: hypotheses testing and visual screening based on the marginal fits. Both methods are quite useful in practice but have their drawbacks. First, it is difficult to construct a powerful procedure for testing multiple hypotheses of linear against nonlinear fits. Second, the screening procedure based on the scatterplots of individual covariate fits may provide an educated guess on the regression function form, but the procedure is ad hoc and lacks theoretical justifications. In this article, we propose a new approach to structure selection for partially linear models, called the LAND (Linear And Nonlinear Discoverer). The procedure is developed in an elegant mathematical framework and possesses desired theoretical and computational properties. Under certain regularity conditions, we show that the LAND estimator is able to identify the underlying true model structure correctly and at the same time estimate the multivariate regression function consistently. The convergence rate of the new estimator is established as well. We further propose an iterative algorithm to implement the procedure and illustrate its performance by simulated and real examples. Supplementary materials for this article are available online.

show abstract

An Absorbing Markov Chain approach to understanding the microbial role in soil carbon stabilization

et al. 2010

View full text Add to dashboard Cite

The number of studies focused on the transformation and sequestration of soil organic carbon (C) has dramatically increased in recent years due to growing interest in understanding the global C cycle. While it is readily accepted that terrestrial C dynamics are heavily influenced by the catabolic and anabolic activities of microorganisms, the incorporation of microbial biomass components into stable soil C pools (via microbial living cells and necromass) has received less attention. Nevertheless, microbialderived C inputs to soils are now increasingly recognized as playing a far greater role in stabilization of soil organic matter than previously believed. Our understanding, however, is limited by the difficulties associated with studying microbial turnover in soils. Here, we describe the use of an Absorbing Markov Chain (AMC) to model the dynamics of soil C transformations among three microbial states: living microbial biomass, microbial necromass, and C removed from living and dead microbial sources. We find that AMC provides a powerful quantitative approach that allows prediction of how C will be distributed among these three states, and how long it will take for the entire amount of initial C to pass through the biomass and necromass pools and be moved into atmosphere. Further, assuming constant C inputs to the model, we can predict how C is eventually distributed, along with how much C sequestrated in soil is microbial-derived. Our work represents a first step in attempting to quantify the flow of C through microbial pathways, and has the potential to increase our understanding of the microbial role in soil C dynamics.

show abstract

Bootstrap consistency for general semiparametric M-estimation

Cheng¹,

Huang²

2010

Ann. Statist.

118

View full text Add to dashboard Cite

Consider M -estimation in a semiparametric model that is characterized by a Euclidean parameter of interest and an infinitedimensional nuisance parameter. As a general purpose approach to statistical inferences, the bootstrap has found wide applications in semiparametric M -estimation and, because of its simplicity, provides an attractive alternative to the inference approach based on the asymptotic distribution theory. The purpose of this paper is to provide theoretical justifications for the use of bootstrap as a semiparametric inferential tool. We show that, under general conditions, the bootstrap is asymptotically consistent in estimating the distribution of the M -estimate of Euclidean parameter; that is, the bootstrap distribution asymptotically imitates the distribution of the M -estimate. We also show that the bootstrap confidence set has the asymptotically correct coverage probability. These general conclusions hold, in particular, when the nuisance parameter is not estimable at root-n rate, and apply to a broad class of bootstrap methods with exchangeable bootstrap weights. This paper provides a first general theoretical study of the bootstrap in semiparametric models.

show abstract

Nonparametric inference in generalized functional linear models

Shang¹,

Cheng²

2015

Ann. Statist.

View full text Add to dashboard Cite

We propose a roughness regularization approach in making nonparametric inference for generalized functional linear models. In a reproducing kernel Hilbert space framework, we construct asymptotically valid confidence intervals for regression mean, prediction intervals for future response and various statistical procedures for hypothesis testing. In particular, one procedure for testing global behaviors of the slope function is adaptive to the smoothness of the slope function and to the structure of the predictors. As a by-product, a new type of Wilks phenomenon [Ann. Math. Stat. 9 (1938) 60-62; Ann. Statist. 29 (2001) 153-193] is discovered when testing the functional linear models. Despite the generality, our inference procedures are easy to implement. Numerical examples are provided to demonstrate the empirical advantages over the competing methods. A collection of technical tools such as integro-differential equation techniques [Trans. Amer. Math. Soc. (1927) 29 755-800; Trans. Amer. Math. Soc. (1928) 30 453-471; Trans. Amer. Math. Soc. (1930) 32 860-868], Stein's method [Ann. Statist. 41 (2013) 2786-2819] [Stein, Approximate Computation of Expectations (1986) IMS] and functional Bahadur representation [Ann. Statist. 41 (2013) 2608-2638] are employed in this paper.Comment: Published at http://dx.doi.org/10.1214/15-AOS1322 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Guang Cheng

Building an efficient intrusion detection system based on feature selection and ensemble classifier

Simultaneous Inference for High-Dimensional Linear Models

Provable Sparse Tensor Decomposition

Local and global asymptotic inference in smoothing spline models

Linear or Nonlinear? Automatic Structure Discovery for Partially Linear Models

An Absorbing Markov Chain approach to understanding the microbial role in soil carbon stabilization

Bootstrap consistency for general semiparametric M-estimation

Nonparametric inference in generalized functional linear models

Contact Info

Product

Resources

About