Univariate feature rankers have been frequently used to order genes (features) in terms of their importance to a given bioinformatics challenge. Unfortunately, the resulting feature subsets tend to differ when applied to related (but distinct) datasets, or when applied to datasets which have been varied or corrupted in some fashion. As a result, a research focus has recently been on methods to measure or improve the stability of these feature subsets. One such method is called rank aggregation. Rank aggregation is the process of combining the information from several ranked lists (or in this case ordered gene lists) into a single more stable list. While there has been work on the creation of these methods, very little work has gone into comparing the lists generated by these techniques. Such a comparison allows for grouping the techniques into families, both for understanding how the families affect rank aggregation and for using less-computationally-expensive members of a given family. This paper is an extensive study on nine rank aggregation techniques across twenty-six bioinformatics datasets. Our results show that certain aggregation techniques are very similar to each other, while others are quite unique in that they are not similar to the other techniques. Additionally, it was found that as the size of the feature subset increases, the similarity between the techniques increases. To our knowledge this is the first study which examines this many rank aggregation techniques within the domain of bioinformatics.
Dimensionality reduction techniques have become a required step when working with bioinformatics datasets. Techniques such as feature selection have been known to not only improve computation time, but to improve the results of experiments by removing the redundant and irrelevant features or genes from consideration in subsequent analysis. Univariate feature selection techniques in particular are well suited for the large levels of high dimensionality that are inherent in bioinformatics datasets (for example: DNA microarray datasets) due to their intuitive output (a ranked lists of features or genes) and their relatively small computational time compared to other techniques. This paper presents seven univariate feature selection techniques and collects them into a single family entitled First Order Statistics (FOS) based feature selection. These seven all share the trait of using first order statistical measures such as mean and standard deviation, although this is the first work to relate them to one another and consider their performance compared with one another. In order to examine the properties of these seven techniques we performed a series of similarity and classification experiments on eleven DNA microarray datasets. Our results show that in general, each feature selection technique will create diverse feature subsets when compared to the other members of the family. However when we look at classification we find that, with one exception, the techniques will produce good classification results and that the techniques will have similar performances to each other. Our recommendation, is to use the rankers Signal-to-Noise and SAM for the best classification results and to avoid Fold Change Ratio as it is consistently the worst performer of the seven rankers.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.