BackgroundIn the clinical context, samples assayed by microarray are often classified by cell line or tumour type and it is of interest to discover a set of genes that can be used as class predictors. The leukemia dataset of Golub et al. [1] and the NCI60 dataset of Ross et al. [2] present multiclass classification problems where three tumour types and nine cell lines respectively must be identified. We apply an evolutionary algorithm to identify the near-optimal set of predictive genes that classify the data. We also examine the initial gene selection step whereby the most informative genes are selected from the genes assayed.ResultsIn the absence of feature selection, classification accuracy on the training data is typically good, but not replicated on the testing data. Gene selection using the RankGene software [3] is shown to significantly improve performance on the testing data. Further, we show that the choice of feature selection criteria can have a significant effect on accuracy. The evolutionary algorithm is shown to perform stably across the space of possible parameter settings – indicating the robustness of the approach. We assess performance using a low variance estimation technique, and present an analysis of the genes most often selected as predictors.ConclusionThe computational methods we have developed perform robustly and accurately, and yield results in accord with clinical knowledge: A Z-score analysis of the genes most frequently selected identifies genes known to discriminate AML and Pre-T ALL leukemia. This study also confirms that significantly different sets of genes are found to be most discriminatory as the sample classes are refined.
Using a method of selecting genes on the basis of their utility for classification [2], we apply optimal gene network inference to the 24 most highly-ranked genes in a leukemia data set [1]. In order to have confidence in the resulting Bayesian gene networks, we first validate the network inference methodology on synthetic data and establish that the methodology has very high specificity, i.e. if an edge is inferred then it is highly likely to be correct. However, we are unable to confidently predict directed edges in the network.Microarray data analysis poses a number of challenges arising from the high dimensionality of the data, the small number of samples, and sample noise. Consequently, significant methodological questions arise. Statistical techniques can identify correlations between the expression levels of genes, while evolutionary computational techniques can be used to learn classifiers that accurately distinguish categories such as AML and ALL (tumour types) in leukaemia data. The genes of most use in classifying samples can be identified in this way, but the relationships between them are not uncovered. To find these relationships, we apply Bayesian network inference.The network inference methodology we present is based on the optimal network search algorithm proposed by Ott [3] which is applied in a resampling framework. ROC analysis of networks recovered from synthetic data provides a measure of the performance of this approach. Having selected a small number of genes from the 7070 assayed in the microarray experiment, we are able to perform network inference having solved the feature selection problem. The class labels inform our analysis of the resulting networks. We show that distinct sub-networks associated with AML and with T-cell responses emerge. Evaluation of the biological plausibility of the results is on-going.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.