BackgroundBeta diversity, which involves the assessment of differences between communities, is an important problem in ecological studies. Many statistical methods have been developed to quantify beta diversity, and among them, UniFrac and weighted-UniFrac (W-UniFrac) are widely used. The W-UniFrac is a weighted sum of branch lengths in a phylogenetic tree of the sequences from the communities. However, W-UniFrac does not consider the variation of the weights under random sampling resulting in less power detecting the differences between communities.ResultsWe develop a new statistic termed variance adjusted weighted UniFrac (VAW-UniFrac) to compare two communities based on the phylogenetic relationships of the individuals. The VAW-UniFrac is used to test if the two communities are different. To test the power of VAW-UniFrac, we first ran a series of simulations which revealed that it always outperforms W-UniFrac, as well as UniFrac when the individuals are not uniformly distributed. Next, all three methods were applied to analyze three large 16S rRNA sequence collections, including human skin bacteria, mouse gut microbial communities, microbial communities from hypersaline soil and sediments, and a tropical forest census data. Both simulations and applications to real data show that VAW-UniFrac can satisfactorily measure differences between communities, considering not only the species composition but also abundance information.ConclusionsVAW-UniFrac can recover biological insights that cannot be revealed by other beta diversity measures, and it provides a novel alternative for comparing communities.
We propose in this paper a statistical framework based on a shape-invariant model together with a false discovery rate (FDR) procedure for identifying periodically expressed genes based on microarray time-course gene expression data and a set of known periodically expressed guide genes. We applied the proposed methods to the alpha-factor, cdc15 and cdc28 synchronized yeast cell cycle data sets and identified a total of 1010 cell-cycle-regulated genes at a FDR of 0.5% in at least one of the three data sets analyzed, including 89 (86%) of 104 known periodic transcripts. We also identified 344 and 201 circadian rhythmic genes in vivo in mouse heart and liver tissues with FDR of 10 and 2.5%, respectively. Our results also indicate that the shape-invariant model fits the data well and provides estimate of the common shape function and the relative phases for these periodically regulated genes.
Our empirical simulation studies showed that the procedure can indeed recover the true functional forms of the covariates and can identify important variables that are related to the risk of an event. Results from predicting survival after chemotherapy for patients with diffuse large B-cell lymphoma demonstrate that the proposed method can be used for identifying important genes that are related to time to death due to cancer and for building a parsimonious model for predicting the survival of future patients. In addition, there is clear evidence of non-linear effects of some genes on survival time.
In functional genomics, one important problem is to relate the microarray gene expression profiles to various clinical phenotypes from patients. The success has been demonstrated in molecular classification of cancer in which gene expression data serve as predictors and different types of cancer are the binary or multi-categorical outcome variable. However, there has been less research in linking gene expression profiles to other types of phenotypes, in particular, the censored survival data such as patients' overall survival or cancer relapse times. In the paper, we develop a kernel Cox regression model for relating gene expression profiles to censored phenotypes in the framework the penalization method in terms of function estimation in reproducing kernel Hilbert spaces. To circumvent the problem of censoring, we use the negative partial likelihood as a loss function in the estimation procedure. The functional combinations of the original gene expression data identified by the method are highly correlated with the patients' survival times and at the same time account for the variability in the gene expression levels. We apply our method to data sets from diffuse large B-cell lymphoma, lung adenocarcinoma and breast carcinoma studies to verify its effectiveness. The results from these analysis indicate that the proposed method works very well in identifying subgroups of patients with different risks of death or relapse and in predicting the risk of relapse or death based on the gene expression profiles measured from the tumor samples taken from the patients.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.