Random forest: A reliable tool for patient response prediction

2013 12th International Conference on Machine Learning and Applications

et al. 2013

Many problems in bioinformatics involve highdimensional, difficult-to-process collections of data. For example, gene microarrays can record the expression levels of thousands of genes, many of which have no relevance to the underlying medical or biological question. Building classification models on such datasets can thus take excessive computational time and still give poor results. Many strategies exist to combat these problems, including feature selection (which chooses only the most relevant genes for building models) and ensemble learners (which combine multiple weak classification learners into one collection which should give a broader view of the data). However, these techniques present a new challenge: choosing which combination of strategies is most appropriate for a given collection of data. This is especially difficult for health informatics and bioinformatics practitioners who do not have an extensive machine learning background. An ideal model should be easy to use and apply, helping the practitioner by either making these choices in advance or by being insensitive to these choices. In this work we demonstrate that the Random Forest learner, when using 100 trees and 200 features (selected by any reasonable feature ranking technique, as the specific choice does not matter), is such a model. To show this, we use 25 bioinformatics datasets from a number of different cancer diagnosis and identification problems, and we compare Random Forest with 5 other learners. We also tested 25 feature ranking techniques and 12 feature subset sizes, to optimize the feature selection step. Our results show that Random Forest with 100 trees and 200 selected features is statistically significantly better than any of the alternatives (or in the case of using 200 features, is statistically equivalent with the top choices), and that the specific choice of ranking technique is statistically insignificant.

Section: Related Workmentioning

confidence: 98%

Random Forest with 200 Selected Features: An Optimal Model for Bioinformatics Research

2013 12th International Conference on Machine Learning and Applications

et al. 2013

“…Our research [3] has shown that Signal-to-Noise is a simple to implement but powerful feature selection technique. However as Signal-to-Noise and the rest of the metrics are less often used in the context of feature selection, we used our own implementation.…”

Section: Related Workmentioning

confidence: 99%

“…The Signal-to-Noise ratio [3] (S2N) represents how well a feature separates two classes. The equation for signal to noise is:…”

Section: First Order Statistics (Fos) Based Feature Selectionmentioning

confidence: 99%

First Order Statistics Based Feature Selection: A Diverse and Powerful Family of Feature Seleciton Techniques

2012 11th International Conference on Machine Learning and Applications

et al. 2012

Dimensionality reduction techniques have become a required step when working with bioinformatics datasets. Techniques such as feature selection have been known to not only improve computation time, but to improve the results of experiments by removing the redundant and irrelevant features or genes from consideration in subsequent analysis. Univariate feature selection techniques in particular are well suited for the large levels of high dimensionality that are inherent in bioinformatics datasets (for example: DNA microarray datasets) due to their intuitive output (a ranked lists of features or genes) and their relatively small computational time compared to other techniques. This paper presents seven univariate feature selection techniques and collects them into a single family entitled First Order Statistics (FOS) based feature selection. These seven all share the trait of using first order statistical measures such as mean and standard deviation, although this is the first work to relate them to one another and consider their performance compared with one another. In order to examine the properties of these seven techniques we performed a series of similarity and classification experiments on eleven DNA microarray datasets. Our results show that in general, each feature selection technique will create diverse feature subsets when compared to the other members of the family. However when we look at classification we find that, with one exception, the techniques will produce good classification results and that the techniques will have similar performances to each other. Our recommendation, is to use the rankers Signal-to-Noise and SAM for the best classification results and to avoid Fold Change Ratio as it is consistently the worst performer of the seven rankers.

“…The first was a study performed by Diaz-Uriarte et al [4] in which the Random Forest classifier was applied toward a series of ten DNA microarray datasets focusing on different areas of the body. Another is a 2011 study performed by Dittman et al [5] which used Random Forest on a pair of DNA microarray datasets with the goal of predicting a patient's response to a drug treatment. Both studies agreed that compared to other classifiers, Random Forest is a powerful classifier which does not require as much parameter adjustment compared to other classifiers.…”

Section: Related Workmentioning

confidence: 99%

Simplifying the Utilization of Machine Learning Techniques for Bioinformatics

2013 12th International Conference on Machine Learning and Applications

et al. 2013

The domain of bioinformatics has a number of challenges such as handling datasets which exhibit extreme levels of high dimensionality (large number of features per sample) and datasets which are particularly difficult to work with. These datasets contain many pieces of data (features) which are irrelevant and redundant to the problem being studied, which makes analysis quite difficult. However, techniques from the domain of machine learning and data mining are well suited to combating these difficulties. Techniques like feature selection (choosing an optimal subset of features for subsequent analysis by removing irrelevant or redundant features) and classifiers (used to build inductive models in order to classify unknown instances) can assist researchers in working with such difficult datasets. Unfortunately, many practitioners of bioinformatics do not have the machine learning knowledge to choose the correct techniques in order to achieve good classification results. If the choices could be simplified or predetermined then it would be easier to apply the techniques. This study is a comprehensive analysis of machine learning techniques on twenty-five bioinformatics datasets using six classifiers, and twenty-four feature rankers. We analyzed the factors at each of four feature subset sizes chosen for being large enough to be effective in creating inductive models but small enough to be of use for further research. Our results shows that Random Forest with 100 trees is the top performing classifier and that the choice of feature ranker is of little importance as long as feature selection occurs. Statistical analysis confirms our results. By choosing these parameters, machine learning techniques are more accessible to bioinformatics.