The C. elegans genome has been completely sequenced, and the developmental anatomy of this model organism is described at single-cell resolution. Here we utilize strategies that exploit this precisely defined architecture to link gene expression to cell type. We obtained RNAs from specific cells and from each developmental stage using tissue-specific promoters to mark cells for isolation by FACS or for mRNA extraction by the mRNA-tagging method. We then generated gene expression profiles of more than 30 different cells and developmental stages using tiling arrays. Machine-learning–based analysis detected transcripts corresponding to established gene models and revealed novel transcriptionally active regions (TARs) in noncoding domains that comprise at least 10% of the total C. elegans genome. Our results show that about 75% of transcripts with detectable expression are differentially expressed among developmental stages and across cell types. Examination of known tissue- and cell-specific transcripts validates these data sets and suggests that newly identified TARs may exercise cell-specific functions. Additionally, we used self-organizing maps to define groups of coregulated transcripts and applied regulatory element analysis to identify known transcription factor– and miRNA-binding sites, as well as novel motifs that likely function to control subsets of these genes. By using cell-specific, whole-genome profiling strategies, we have detected a large number of novel transcripts and produced high-resolution gene expression maps that provide a basis for establishing the roles of individual genes in cellular differentiation.
SUMMARYThe responses of plants to abiotic stresses are accompanied by massive changes in transcriptome composition. To provide a comprehensive view of stress-induced changes in the Arabidopsis thaliana transcriptome, we have used whole-genome tiling arrays to analyze the effects of salt, osmotic, cold and heat stress as well as application of the hormone abscisic acid (ABA), an important mediator of stress responses. Among annotated genes in the reference strain Columbia we have found many stress-responsive genes, including several transcription factor genes as well as pseudogenes and transposons that have been missed in previous analyses with standard expression arrays. In addition, we report hundreds of newly identified, stress-induced transcribed regions. These often overlap with known, annotated genes. The results are accessible through the Arabidopsis thaliana Tiling Array Express (At-TAX) homepage, which provides convenient tools for displaying expression values of annotated genes, as well as visualization of unannotated transcribed regions along each chromosome.
The spindle assembly checkpoint is a conserved signalling pathway that protects genome integrity. Given its central importance, this checkpoint should withstand stochastic fluctuations and environmental perturbations, but the extent of and mechanisms underlying its robustness remain unknown. We probed spindle assembly checkpoint signalling by modulating checkpoint protein abundance and nutrient conditions in fission yeast. For core checkpoint proteins, a mere 20% reduction can suffice to impair signalling, revealing a surprising fragility. Quantification of protein abundance in single cells showed little variability (noise) of critical proteins, explaining why the checkpoint normally functions reliably. Checkpoint-mediated stoichiometric inhibition of the anaphase activator Cdc20 (Slp1 in Schizosaccharomyces pombe) can account for the tolerance towards small fluctuations in protein abundance and explains our observation that some perturbations lead to non-genetic variation in the checkpoint response. Our work highlights low gene expression noise as an important determinant of reliable checkpoint signalling.
We present SplashRNA, a sequential classifier to predict potent microRNA-based short hairpin RNAs (shRNAs). Trained on published and novel datasets, SplashRNA outperforms previous algorithms and reliably predicts the most efficient shRNAs for a given gene. Combined with an optimized miR-E backbone, >90% of high-scoring SplashRNA predictions trigger >85% protein knockdown when expressed from a single genomic integration. SplashRNA can significantly improve the accuracy of loss-of-function genetics studies and facilitates the generation of compact shRNA libraries.
The linear mixed model (LMM) is now routinely used to estimate heritability. Unfortunately, as we demonstrate, LMM estimates of heritability can be inflated when using a standard model. To help reduce this inflation, we used a more general LMM with two random effects-one based on genomic variants and one based on easily measured spatial location as a proxy for environmental effects. We investigated this approach with simulated data and with data from a Uganda cohort of 4,778 individuals for 34 phenotypes including anthropometric indices, blood factors, glycemic control, blood pressure, lipid tests, and liver function tests. For the genomic random effect, we used identity-by-descent estimates from accurately phased genomewide data. For the environmental random effect, we constructed a covariance matrix based on a Gaussian radial basis function. Across the simulated and Ugandan data, narrow-sense heritability estimates were lower using the more general model. Thus, our approach addresses, in part, the issue of "missing heritability" in the sense that much of the heritability previously thought to be missing was fictional. Software is available at https://github.com/MicrosoftGenomics/ FaST-LMM. A n important causal question comes from the age-old debate about nature versus nurture. For any phenotype such as height or intelligence quotient, how much of the phenotype is inherited and how much is determined by environment? This question was made precise by Fisher (1) and Wright (2) almost a century ago: Given observations of a phenotype from a population of individuals, what is the fraction of variance of the phenotype that is caused by inherited factors relative to the total variance of the phenotype due to both inherited and environmental factors? This fraction, termed "heritability," has been the subject of intense study across various phenotypes and populations since it was defined. Note that, in contrast to how some interpret the informal question around the nature-versus-nurture debate, heritability is not an absolute quantity but rather a quantity relative to a given population. For example, a phenotype in a population where environmental factors have large variation will have a smaller heritability than in an otherwise similar population where environmental factors have a small variation.Over the years, many approaches have been developed to estimate heritability from data (3, 4). Here, we concentrate on an approach made possible by the recent ability to sequence genomes at a modest cost (5, 6). The approach uses a linear mixed model (LMM), a form of multivariate regression of the genomic and environmental factors on the phenotype, which we examine in detail in the next section.In the standard LMM approach, the effects of environmental factors on the phenotype are modeled as noise. Specifically, the phenotype of each individual is assumed to be the sum of two random effects, one based on genomic factors and one based on environmental factors, where the latter is assumed to be mutually independent across indivi...
Gene expression maps for model organisms, including Arabidopsis thaliana, have typically been created using gene-centric expression arrays. Here, we describe a comprehensive expression atlas, Arabidopsis thaliana Tiling Array Express (At-TAX), which is based on whole-genome tiling arrays. We demonstrate that tiling arrays are accurate tools for gene expression analysis and identified more than 1,000 unannotated transcribed regions. Visualizations of gene expression estimates, transcribed regions, and tiling probe measurements are accessible online at the At-TAX homepage.
Motivation: Set-based variance component tests have been identified as a way to increase power in association studies by aggregating weak individual effects. However, the choice of test statistic has been largely ignored even though it may play an important role in obtaining optimal power. We compared a standard statistical test—a score test—with a recently developed likelihood ratio (LR) test. Further, when correction for hidden structure is needed, or gene–gene interactions are sought, state-of-the art algorithms for both the score and LR tests can be computationally impractical. Thus we develop new computationally efficient methods.Results: After reviewing theoretical differences in performance between the score and LR tests, we find empirically on real data that the LR test generally has more power. In particular, on 15 of 17 real datasets, the LR test yielded at least as many associations as the score test—up to 23 more associations—whereas the score test yielded at most one more association than the LR test in the two remaining datasets. On synthetic data, we find that the LR test yielded up to 12% more associations, consistent with our results on real data, but also observe a regime of extremely small signal where the score test yielded up to 25% more associations than the LR test, consistent with theory. Finally, our computational speedups now enable (i) efficient LR testing when the background kernel is full rank, and (ii) efficient score testing when the background kernel changes with each test, as for gene–gene interaction tests. The latter yielded a factor of 2000 speedup on a cohort of size 13 500.Availability: Software available at http://research.microsoft.com/en-us/um/redmond/projects/MSCompBio/Fastlmm/.Contact: heckerma@microsoft.comSupplementary information: Supplementary data are available at Bioinformatics online.
We examine improvements to the linear mixed model (LMM) that better correct for population structure and family relatedness in genome-wide association studies (GWAS). LMMs rely on the estimation of a genetic similarity matrix (GSM), which encodes the pairwise similarity between every two individuals in a cohort. These similarities are estimated from single nucleotide polymorphisms (SNPs) or other genetic variants. Traditionally, all available SNPs are used to estimate the GSM. In empirical studies across a wide range of synthetic and real data, we find that modifications to this approach improve GWAS performance as measured by type I error control and power. Specifically, when only population structure is present, a GSM constructed from SNPs that well predict the phenotype in combination with principal components as covariates controls type I error and yields more power than the traditional LMM. In any setting, with or without population structure or family relatedness, a GSM consisting of a mixture of two component GSMs, one constructed from all SNPs and another constructed from SNPs that well predict the phenotype again controls type I error and yields more power than the traditional LMM. Software implementing these improvements and the experimental comparisons are available at http://microsoft.com/science.T here has been a great deal of interest in statistical methods for genome-wide association studies (GWAS).While linear or logistic regression have been commonly used for this task, the need to move beyond these models has become clear. One important motivation for more sophisticated models is the existence of confounding structure, including population structure and family relatedness. Recently, the linear mixed model (LMM) has emerged as the model of choice to correct for such confounding structure 1-9 . Despite its rapid acceptance, however, there remain concerns about its use, and several improvements have been proposed.One suggested improvement is the inclusion of principal components (PCs) as covariates to better capture population structure 8 . Another proposed improvement is to use only a subset of single nucleotide polymorphisms (SNPs) for inclusion in the LMM [4][5][6][7]10 . In particular, the LMM relies on an estimate of the genetic similarity matrix (GSM), which encodes the pairwise similarity between every two individuals in the data set. These similarities are estimated from SNPs or other genetic variants. While traditionally all available SNPs are used to estimate the GSM, researchers have considered using a subset, chosen in at least two different ways.In one approach, SNPs are chosen such that they are roughly equally spaced across the genome 4 . The idea behind this approach is that linkage disequilibrium (LD) among the SNPs mitigates the need to use all of them. One motivation underlying this approach is computational efficiency. Namely, when the number of selected SNPs is less than the sample size of the data, then the computation of P values becomes linear in sample size, rather than qu...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.