As a commonly used technique in data preprocessing, feature selection selects a subset of informative attributes or variables to build models describing data. By removing redundant and irrelevant or noise features, feature selection can improve the predictive accuracy and the comprehensibility of the predictors or classifiers. Many feature selection algorithms with different selection criteria has been introduced by researchers. However, it is discovered that no single criterion is best for all applications. In this paper, we propose a framework based on a genetic algorithm (GA) for feature subset selection that combines various existing feature selection methods. The advantages of this approach include the ability to accommodate multiple feature selection criteria and find small subsets of features that perform well for a particular inductive learning algorithm of interest to build the classifier. We conducted experiments using three data sets and three existing feature selection methods. The experimental results demonstrate that our approach is a robust and effective approach to find subsets of features with higher classification accuracy and/or smaller size compared to each individual feature selection algorithm.
Abstract. Due to the huge number of genes and comparatively small number of samples from microarray gene expression data, accurate classification of diseases becomes challenging. Feature selection techniques can improve the classification accuracy by removing irrelevant and redundant genes. However, the performance of different feature selection algorithms based on different theoretic arguments varies even when they are applied to the same data set. In this paper, we propose a hybrid approach to combine useful outcomes from different feature selection methods through a genetic algorithm. The experimental results demonstrate that our approach can achieve better classification accuracy with a smaller gene subset than each individual feature selection algorithm does.
Microarray data usually contains a huge number of genes (features) and a comparatively small number of samples, which make accurate classification or prediction of diseases challenging. Feature selection techniques can help us identify important and irrelevant (unimportant) features by applying certain selection criteria. However, different feature selection algorithms based on various theoretical arguments often produce different results when applied to the same data set. This makes selecting an optimal or near optimal feature subset for a data set difficult. In this paper, we propose using a genetic algorithm to improve feature subset selection by combining valuable outcomes from multiple feature selection methods. The goal of our genetic algorithm is to achieve a balance between the classification accuracy and the size of the feature subsets selected. The advantages of this approach include the ability to accommodate different feature selection criteria and find small subsets of features that perform well for a particular inductive learning algorithm of interest to build the classifier. The experimental results demonstrate that our approach can find subsets of features with higher classification accuracy and/or smaller size compared with each individual feature selection algorithm.
Building a system based on variants of disparate individual components/programs is usually a challenging task. The components/programs are not designed to communicate with each other but the whole system construction does require a seamless collaboration among them. In this paper, targeting at protein structure prediction, a pluggable application server framework is presented. The framework is capable of combining various existing programs into an efficient unit and the design is devoted to provide a model which is able to integrate heterogeneous components/programs into the system quickly without modifying their codes. Based on the model, different components can be plugged into the system with easy configuration, which would lead to a self-configurable and adaptive system. A protein structure prediction server implementation was developed by applying the design model and the real implementation emphasizes the efficiency and simplicity of the system construction. The method and model are generic and can be applied to other system design as well.
RNA plays a critical role in mediating every step of cellular information transfer from genes to functional proteins. Pseudoknots are functionally important and widely occurring structural motifs found in all types of RNA. Therefore predicting their structures is an important problem. In this paper, we present a new RNA pseudoknot structure prediction method based on term rewriting. The method is implemented using the Mfold RNA/DNA folding package and the term rewriting language Maude. In our method, RNA structures are treated as terms and rules are discovered for predicting pseudoknots. Our method was tested on 211 pseudoknots in PseudoBase and achieves an average accuracy of 74.085% compared to the experimentally determined structure. In fact, most pseudoknots discovered by our method achieve an accuracy of above 90%. These results indicate that term rewriting has a broad potential in RNA applications ranging from prediction of pseudoknots to discovery of higher level RNA structures involving complex RNA tertiary interactions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.