In this study, we introduce and use Efficiency Analysis to compare differences in the apparent internal and external consistency of competing normalization methods and tests for identifying differentially expressed genes. Using publicly available data, two lung adenocarcinoma datasets were analyzed using caGEDA (http://bioinformatics2.pitt.edu/GE2/GEDA.html) to measure the degree of differential expression of genes existing between two populations. The datasets were randomly split into at least two subsets, each analyzed for differentially expressed genes between the two sample groups, and the gene lists compared for overlapping genes. Efficiency Analysis is an intuitive method that compares the differences in the percentage of overlap of genes from two or more data subsets, found by the same test over a range of testing methods. Tests that yield consistent gene lists across independently analyzed splits are preferred to those that yield less consistent inferences. For example, a method that exhibits 50% overlap in the 100 top genes from two studies should be preferred to a method that exhibits 5% overlap in the top 100 genes. The same procedure was performed using all available normalization and transformation methods that are available through caGEDA. The ‘best’ test was then further evaluated using internal cross-validation to estimate generalizable sample classification errors using a Naïve Bayes classification algorithm. A novel test, termed D1 (a derivative of the J5 test) was found to be the most consistent, and to exhibit the lowest overall classification error, and highest sensitivity and specificity. The D1 test relaxes the assumption that few genes are differentially expressed. Efficiency Analysis can be misleading if the tests exhibit a bias in any particular dimension (e.g. expression intensity); we therefore explored intensity-scaled and segmented J5 tests using data in which all genes are scaled to share the same intensity distribution range. Efficiency Analysis correctly predicted the ‘best’ test and normalization method using the Beer dataset and also performed well with the Bhattacharjee dataset based on both efficiency and classification accuracy criteria.
The explosion of microarray data from pilot studies, basic research and large-scale clinical trials requires the development of integrative computational tools that can not only analyse gene expression patterns but that can also evaluate the methods of analysis adopted and then provide a boost to post-analysis translational interpretation of those patterns. We have developed a web application called caGEDA (cancer gene expression data analyzer) that can: (1) upload gene expression profiles from cDNA or oligonucleotide microarrays; (2) conduct a diverse range of serial linear normalisations; (3) identify differentially expressed genes using a variety of tests - either threshold or permutation tests; (4) produce tables of literature references to papers reporting that specific genes (identified by accession numbers) are up- or down-regulated in specific cancers; (5) estimate the error of sample class prediction using the significant gene set for features; (6) perform low-bias and accurate validated learning using three computational validation techniques (leave-one out validation, k-fold validation, random re-sampling validation); and (7) validate a classifier with a randomly selected or user-defined validation set. Significant genes are reported in a table of links to entries in the following databases: Locus Link, Genome View, UCSC, Ensembl, UniGene, dbSNP, AmiGO and OMIM. caGEDA is seamlessly integrated via embedded forms with UCSD's (University of California at San Diego) 2HAPI server (for medical subject heading (MeSH) term exploration) and EZ-Retrieve (to identify common transcription factors located upstream of sets of genes that exhibit similar modes of differential expression). caGEDA offers a variety of previously described and novel tests for differentially expressed genes, most notably the permutation percentile separability test, which is most appropriate for identifying genes that are significantly differentially expressed in a subset of patients. caGEDA, which is open source and free to academic users, will soon be greatly enhanced by operating with the components of the National Cancer Institute's new cancer bioinformatics grid (caBIG).
Background: Microarray studies in cancer compare expression levels between two or more sample groups on thousands of genes. Data analysis follows a population-level approach (e.g., comparison of sample means) to identify differentially expressed genes. This leads to the discovery of 'population-level' markers, i.e., genes with the expression patterns A > B and B > A. We introduce the PPST test that identifies genes where a significantly large subset of cases exhibit expression values beyond upper and lower thresholds observed in the control samples.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.