BackgroundMicroarray data analysis is notorious for involving a huge number of genes compared to a relatively small number of samples. Gene selection is to detect the most significantly differentially expressed genes under different conditions, and it has been a central research focus. In general, a better gene selection method can improve the performance of classification significantly. One of the difficulties in gene selection is that the numbers of samples under different conditions vary a lot.ResultsTwo novel gene selection methods are proposed in this paper, which are not affected by the unbalanced sample class sizes and do not assume any explicit statistical model on the gene expression values. They were evaluated on eight publicly available microarray datasets, using leave-one-out cross-validation and 5-fold cross-validation. The performance is measured by the classification accuracies using the top ranked genes based on the training datasets.ConclusionThe experimental results showed that the proposed gene selection methods are efficient, effective, and robust in identifying differentially expressed genes. Adopting the existing SVM-based and KNN-based classifiers, the selected genes by our proposed methods in general give more accurate classification results, typically when the sample class sizes in the training dataset are unbalanced.
Background: Recently several statistical methods have been proposed to identify genes with differential expression between two conditions. However, very few studies consider the problem of sample imbalance and there is no study to investigate the impact of sample imbalance on identifying differential expression genes. In addition, it is not clear which method is more suitable for the unbalanced data.
Granger causality (GC) has been widely applied in economics and neuroscience to reveal causality influence of time series. In our previous paper (Hu et al., in IEEE Trans on Neural Netw, 22(6), pp. 829-844, 2011), we proposed new causalities in time and frequency domains and particularly focused on new causality in frequency domain by pointing out the shortcomings/limitations of GC or Granger-alike causality metrics and the advantages of new causality. In this paper we continue our previous discussions and focus on new causality and GC or Granger-alike causality metrics in time domain. Although one strong motivation was introduced in our previous paper (Hu et al., in IEEE Trans on Neural Netw, 22(6), pp. 829-844, 2011) we here present additional motivation for the proposed new causality metric and restate the previous motivation for completeness. We point out one property of conditional GC in time domain and the shortcomings/limitations of conditional GC which cannot reveal the real strength of the directional causality among three time series. We also show the shortcomings/limitations of directed causality (DC) or normalize DC for multivariate time series and demonstrate it cannot reveal real causality at all. By calculating GC and new causality values for an example we demonstrate the influence of one of the time series on the other is linearly increased as the coupling strength is linearly increased. This fact further supports reasonability of new causality metric. We point out that larger instantaneous correlation does not necessarily mean larger true causality (e.g., GC and new causality), or vice versa. Finally we conduct analysis of statistical test for significance and asymptotic distribution property of new causality metric by illustrative examples.
In [1] we proposed two methods to identify the reference electrode signal under the key assumption that the reference signal is independent from EEG sources. This assumption is shown to be possibly true for intracranial EEG with a scalp reference. In this paper, we theoretically prove that the obtained reference signal by using the second method in [1] or the equivalent MPDR approach [2] outperforms the widely used average reference (AR) if the real reference is independent from EEG sources. The simulation results confirm the advantages over AR.
The cross-validation is probably the most popular approach for estimating the classification error rate in classifying gene expression data. In order to reduce the variance of estimation, the procedure of cross-validation will be repeated to get the average result. However, the repetition number of cross-validation is generally set by an empirical value. This paper proposed two methods (FCI and TSE) for determining the repeat number of cross-validation based on the approximate confidence interval. The experimental results on real data show the empirical method of giving repeat number of cross-validation is usually unreliable and the proposed methods can determine cross-validation repeat number to achieve a pre-specified precision of the error rate. Furthermore, both methods can automatically adjust to meet the change of data, the value of k-fold and classification model.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.