Stability Analysis of Feature Ranking Techniques on Biological Datasets

2014 IEEE International Conference on Bioinformatics and Bioengineering

Napolitano

2014

Self Cite

One of the more prevalent problems when working with bioinformatics datasets is class imbalance, when there are more instances in one class compared to the other class(es). This problem is made worse because frequently, the class of interest is also the minority class. A possible solution is data sampling, a powerful tool for combating class imbalance by adding or removing instances to make the dataset more balanced. In addition to the choice of including data sampling, one of the most important decisions when applying data sampling is what the final class ratio should be. Commonly, the final class ratio when data sampling is applied is 50:50, however it is an open question whether other ratios are more appropriate for certain imbalanced datasets (all datasets in this paper have 25.16% minority instances or less) where a 50:50 ratio requires extreme modification to the dataset. In this work we compare six different data sampling approaches (feature selection with the pairwise combinations of three data sampling techniques and two final class ratios) with feature selection without data sampling with the goal of determining if the inclusion of data sampling is beneficial and if so, what should be the final class ratio. In order to test the six data sampling approaches and feature selection alone thoroughly, we utilize seven imbalanced and high-dimensional datasets, three feature selection techniques, and six classifiers. Our results show that for a majority of scenarios, random undersampling along with either 35:65 or 50:50 is the best data sampling approach. Statistical analysis shows that there is no significant difference between the data sampling approaches. However, despite this, we still recommend using random undersampling along with 35:65 as the final class ratio. This is because of the frequency of both random undersampling and 35:65 being the most frequent top performing data sampling technique and class ratio respectively. Additionally, 35:65 will have fewer negative impacts than 50:50 (less data loss or overfitting, which makes it a better choice if all other factors are equal) and random undersampling is more computationally efficient than any other form of sampling, including "no sampling" (both by not requiring any internal calculations and by producing a reduced, easier-to-work-with dataset). To our knowledge, this is the most comprehensive work which focuses on the choice of the inclusion and implementation of data sampling with different final class ratios on bioinformatics datasets which exhibit such large levels of class imbalance.

Section: B Feature Selection Techniquesmentioning

confidence: 88%

Selecting the Appropriate Data Sampling Approach for Imbalanced and High-Dimensional Bioinformatics Datasets

2014 IEEE International Conference on Bioinformatics and Bioengineering

Napolitano

2014

Self Cite

“…We believe, it is important to observe how the choice between RUSBoost combined with external feature selection and SelectRUSBoost affect techniques with varying degrees of stability. According to previous research [7] we see that IG has average to below average stability; ROC is one of the most stable feature selection techniques; and S2N is above average in terms of stability.…”

Section: B Feature Selection Techniquesmentioning

confidence: 86%

Contrasting Undersampled Boosting with Internal and External Feature Selection for Patient Response Datasets

2013 12th International Conference on Machine Learning and Applications

Wald

et al. 2013

Self Cite

Class imbalance (where one class has many more instances than the other class(es)) and high dimensionality (large number of features per instance) are two prevalent problems that are frequently present in patient response datasets. In addition to these problems, these datasets are notoriously difficult to build effective models from. This paper introduces a new hybrid boosting algorithm named SelectRUSBoost which combines data sampling and feature selection with every iteration of boosting. We test SelectRUSBoost along with RUSBoost combined with external feature selection on a set of five patient response datasets. In addition to the datasets we also utilize two classifiers, three filter-based feature selection techniques, and four feature subset sizes. Our results show that SelectRUSBoost will, with few exceptions, outperform RUSBoost combined with external feature selection. Also, the feature selection technique information gain outperformed the other techniques for all combinations of boosting approach, classifier, and feature subset size, and in addition for this feature selection technique SelectRUSBoost always (without exception) outperformed RUSBoost combined with external selection. Statistical analysis confirmed that SelectRUSBoost gives better performance than RUSBoost combined with external selection. This is the first work which utilizes SelectRUSBoost in a bioinformatics study.

“…All of these learners are available with the Weka machine learning toolkit [9]. Due to space considerations we cannot elaborate further on each dataset; refer to the work of Dittman et al [10] for more information on the datasets in Table I.…”

Section: Methodsmentioning

confidence: 99%

“…Based on the training data, a logistic regression model is created which is used to decide the class membership of future instances [10].…”

Section: Classifiersmentioning

confidence: 99%

First Order Statistics Based Feature Selection: A Diverse and Powerful Family of Feature Seleciton Techniques

2012 11th International Conference on Machine Learning and Applications

Wald

et al. 2012

Self Cite

Dimensionality reduction techniques have become a required step when working with bioinformatics datasets. Techniques such as feature selection have been known to not only improve computation time, but to improve the results of experiments by removing the redundant and irrelevant features or genes from consideration in subsequent analysis. Univariate feature selection techniques in particular are well suited for the large levels of high dimensionality that are inherent in bioinformatics datasets (for example: DNA microarray datasets) due to their intuitive output (a ranked lists of features or genes) and their relatively small computational time compared to other techniques. This paper presents seven univariate feature selection techniques and collects them into a single family entitled First Order Statistics (FOS) based feature selection. These seven all share the trait of using first order statistical measures such as mean and standard deviation, although this is the first work to relate them to one another and consider their performance compared with one another. In order to examine the properties of these seven techniques we performed a series of similarity and classification experiments on eleven DNA microarray datasets. Our results show that in general, each feature selection technique will create diverse feature subsets when compared to the other members of the family. However when we look at classification we find that, with one exception, the techniques will produce good classification results and that the techniques will have similar performances to each other. Our recommendation, is to use the rankers Signal-to-Noise and SAM for the best classification results and to avoid Fold Change Ratio as it is consistently the worst performer of the seven rankers.