Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data

Kim, Sangjin; Kim, Jong Min

doi:10.3390/math7060493

Cited by 7 publications

(8 citation statements)

References 52 publications

(56 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With this setup of high-dimensional data, we simulated three different types of data, each with correlation structures ρ = 0.2, 0.5, and 0.8 respectively. These values show the low, intermediate, and high correlation structures in the datasets which are significantly similar to what we usually see in the gene expression or others among many types of data in the field of bioinformatics [13,52]. At first, the data were divided randomly into training and testing sets with 75% and 25% of samples respectively; 75% of the training data was given to the FS methods, which ranked the genes concerning their importance, and then the top-ranked genes were selected based on b-SIS condition.…”

Section: Simulation Data Setupsupporting

confidence: 76%

“…From [13], we see that the resampling-based FS is relatively more efficient in comparison to the other existing FS methods in gene expression data. The RLFS method is based on the lasso penalized regression method and the resampling approach employed to obtain the ranked important features using the frequency.…”

Section: The Resampling-based Lasso Feature Selectionmentioning

confidence: 93%

“…The FS methods are used to reduce the dimensionality of data by removing noisy and redundant features that help in selecting the truly important features. The FS methods are classified into rank-based and subset methods [12,13]. Rank-based methods rank all the features with respect to their importance based on some criteria.…”

Section: Introductionmentioning

confidence: 99%

“…In this article, we introduce the combination of an ensemble classifier with an FS method-the resampling-based lasso feature selection (RLFS) method for ranking features, and ensemble of regularized regression models (ERRM) for classification purposes. The resampling approach was proven to be one of the best FS screening steps in a high-dimensional data setting [13]. The RLFS uses the selection probability with lasso penalty, and the threshold for selecting the top-ranked features is set using b-SIS condition; and these select features were applied to the ERRM to achieve the best prediction accuracy.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Combination of Ensembles of Regularized Regression Models with Resampling-Based Lasso Feature Selection in High Dimensional Data

Patil

Kim

2020

Mathematics

Self Cite

View full text Add to dashboard Cite

In high-dimensional data, the performances of various classifiers are largely dependent on the selection of important features. Most of the individual classifiers with the existing feature selection (FS) methods do not perform well for highly correlated data. Obtaining important features using the FS method and selecting the best performing classifier is a challenging task in high throughput data. In this article, we propose a combination of resampling-based least absolute shrinkage and selection operator (LASSO) feature selection (RLFS) and ensembles of regularized regression (ERRM) capable of dealing data with the high correlation structures. The ERRM boosts the prediction accuracy with the top-ranked features obtained from RLFS. The RLFS utilizes the lasso penalty with sure independence screening (SIS) condition to select the top k ranked features. The ERRM includes five individual penalty based classifiers: LASSO, adaptive LASSO (ALASSO), elastic net (ENET), smoothly clipped absolute deviations (SCAD), and minimax concave penalty (MCP). It was built on the idea of bagging and rank aggregation. Upon performing simulation studies and applying to smokers’ cancer gene expression data, we demonstrated that the proposed combination of ERRM with RLFS achieved superior performance of accuracy and geometric mean.

show abstract

Section: Simulation Data Setupsupporting

confidence: 76%

Section: The Resampling-based Lasso Feature Selectionmentioning

confidence: 93%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Combination of Ensembles of Regularized Regression Models with Resampling-Based Lasso Feature Selection in High Dimensional Data

Patil

Kim

2020

Mathematics

Self Cite

View full text Add to dashboard Cite

show abstract

“…Some of the existing studies for statistical comparison includes unsupervised clustering and normalization [20,21], supervised feature ranking and classi cation methods [22,23,19,24,25]. feature ranking and classi cation have been extensively utilized in microarray gene expression studies [26,27]. The key di erence of DNAm data from gene expression is that the DNAm has continuous variables ranging between 0 and 1.…”

Section: Introductionmentioning

confidence: 99%

Analyzing high dimensional correlated data using feature ranking and classifiers

Patil

Leung

Kim

2019

Computational and Mathematical Biophysics

Self Cite

View full text Add to dashboard Cite

The Illumina Infinium HumanMethylation27 (Illumina 27K) BeadChip assay is a relatively recent high-throughput technology that allows over 27,000 CpGs to be assayed. The Illumina 27K methylation data is less commonly used in comparison to gene expression in bioinformatics. It provides a critical need to find the optimal feature ranking (FR) method for handling the high dimensional data. The optimal FR method on the classifier is not well known, and choosing the best performing FR method becomes more challenging in high dimensional data setting. Therefore, identifying the statistical methods which boost the inference is of crucial importance in this context. This paper describes the detailed performances of FR methods such as fisher score, information gain, chi-square, and minimum redundancy and maximum relevance on different classification methods such as Adaboost, Random Forest, Naive Bayes, and Support Vector Machines. Through simulation study and real data applications, we show that the fisher score as an FR method, when applied on all the classifiers, achieved best prediction accuracy with significantly small number of ranked features.

show abstract

PYE: A Penalized Youden Index Estimator for selecting and combining biomarkers in high-dimensional data

Salaroli

Pardo

2023

Chemometrics and Intelligent Laboratory Systems

View full text Add to dashboard Cite

Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data

Cited by 7 publications

References 52 publications

Combination of Ensembles of Regularized Regression Models with Resampling-Based Lasso Feature Selection in High Dimensional Data

Combination of Ensembles of Regularized Regression Models with Resampling-Based Lasso Feature Selection in High Dimensional Data

Analyzing high dimensional correlated data using feature ranking and classifiers

PYE: A Penalized Youden Index Estimator for selecting and combining biomarkers in high-dimensional data

Contact Info

Product

Resources

About