Genomic association studies of common or rare protein-coding variation have established robust statistical approaches to account for multiple testing. Here, we present a comparable framework to evaluate rare and de novo noncoding single nucleotide variants, insertion/deletions, and all classes of structural variation from whole-genome sequencing (WGS). Integrating genomic annotations at the level of nucleotides, genes, and regulatory regions, we define 51,801 annotation categories. Analyses of 519 autism spectrum disorder families did not identify association with any categories after correction for 4,123 effective tests. Without appropriate correction, biologically plausible associations are observed in both cases and controls. Despite excluding previously identified gene-disrupting mutations, coding regions still exhibited the strongest associations. Thus, in autism the contribution of de novo noncoding variation is probably modest compared to de novo coding variants. Robust results from future WGS studies will require large cohorts and comprehensive analytical strategies that consider the substantial multiple testing burden.
Whole-genome sequencing (WGS) has facilitated the first genome-wide evaluations of the contribution of de novo noncoding mutations to complex disorders. Using WGS, we assess genetic variation from 7,608 samples in 1,902 autism spectrum disorder (ASD) families, identifying 255,106 de novo mutations. In contrast to coding mutations, no noncoding functional annotation category, analyzed in isolation, is significantly associated with ASD. Casting noncoding variation in the context of a de novo risk score across multiple annotation categories, however, does demonstrate association with mutations localized to promoter regions. The strongest driver of this promoter signal emanates from evolutionarily conserved transcription factor binding sites distal to the transcription start site. These data suggest that de novo mutations in promoter regions, characterized by evolutionary and functional signatures, contribute to ASD.
Reliable uncertainty estimation for time series prediction is critical in many fields, including physics, biology, and manufacturing. At Uber, probabilistic time series forecasting is used for robust prediction of number of trips during special events, driver incentive allocation, as well as real-time anomaly detection across millions of metrics. Classical time series models are often used in conjunction with a probabilistic formulation for uncertainty estimation. However, such models are hard to tune, scale, and add exogenous variables to. Motivated by the recent resurgence of Long Short Term Memory networks, we propose a novel end-to-end Bayesian deep model that provides time series prediction along with uncertainty estimation. We provide detailed experiments of the proposed solution on completed trips data, and successfully apply it to large-scale time series anomaly detection at Uber.Comment: To appear in DSBDA-2017 @ ICDM'1
In this paper, which is the second in a series of two, the preasymptotic error analysis of the continuous interior penalty finite element method (CIP-FEM) and the FEM for the Helmholtz equation in two and three dimensions is continued. While Part I contained results on the linear CIP-FEM and FEM, the present part deals with approximation spaces of order p ≥ 1. By using a modified duality argument, preasymptotic error estimates are derived for both methods under the condition of kh, where k is the wave number, h is the mesh size, and C 0 is a constant independent of k, h, p, and the penalty parameters. It is shown that the pollution errors of both methods inif the exact solution u ∈ H 2 (Ω) which coincide with existent dispersion analyses for the FEM on Cartesian grids. Here σ is a constant independent of k, h, p and the penalty parameters. Moreover, it is proved that the CIP-FEM is stable for any k, h, p > 0 and penalty parameters with positive imaginary parts. Besides the advantage of the absolute stability of the CIP-FEM compared to the FEM, the penalty parameters may be tuned to reduce the pollution effects.Key words. Helmholtz equation, large wave number, preasymptotic error estimates, continuous interior penalty finite element methods, finite element methods
Recent advances in technology have enabled the measurement of RNA levels for individual cells. Compared to traditional tissue-level bulk RNA-seq data, single cell sequencing yields valuable insights about gene expression profiles for different cell types, which is potentially critical for understanding many complex human diseases. However, developing quantitative tools for such data remains challenging because of high levels of technical noise, especially the “dropout” events. A “dropout” happens when the RNA for a gene fails to be amplified prior to sequencing, producing a “false” zero in the observed data. In this paper, we propose a Unified RNA-Sequencing Model (URSM) for both single cell and bulk RNA-seq data, formulated as a hierarchical model. URSM borrows the strength from both data sources and carefully models the dropouts in single cell data, leading to a more accurate estimation of cell type specific gene expression profile. In addition, URSM naturally provides inference on the dropout entries in single cell data that need to be imputed for downstream analyses, as well as the mixing proportions of different cell types in bulk samples. We adopt an empirical Bayes’ approach, where parameters are estimated using the EM algorithm and approximate inference is obtained by Gibbs sampling. Simulation results illustrate that URSM outperforms existing approaches both in correcting for dropouts in single cell data, as well as in deconvolving bulk samples. We also demonstrate an application to gene expression data on fetal brains, where our model successfully imputes the dropout genes and reveals cell type specific expression patterns.
SignificanceGrowth typically involves differentiation of cells from progenitors into more specialized descendants, often involving lineages of pure and transitional cells to achieve final form. Recent technology has enabled estimation of gene expression profiles of single cells and these profiles theoretically differentiate pure cell types. What is missing from the analytical toolbox is an efficient technique to classify pure and transitional cells from their profiles. Here we propose semisoft clustering with pure cells (SOUP). This algorithm performs well in the hard-clustering problem for pure cell types and excels at identifying transitional cells with soft memberships. Moreover, SOUP provides an estimate of the developmental trajectories based on the estimated cell type membership that naturally adapts to cells in transition.
We propose and analyze a generic method for community recovery in stochastic block models and degree corrected block models. This approach can exactly recover the hidden communities with high probability when the expected node degrees are of order log n or higher. Starting from a roughly correct community partition given by some conventional community recovery algorithm, this method refines the partition in a cross clustering step. Our results simplify and extend some of the previous work on exact community recovery, discovering the key role played by sample splitting. The proposed method is simple and can be implemented with many practical community recovery algorithms.
Transcription at enhancers is a widespread phenomenon which produces so-called enhancer RNA (eRNA) and occurs in an activity dependent manner. However, the role of eRNA and its utility in exploring disease-associated changes in enhancer function, and the downstream coding transcripts that they regulate, is not well established. We used transcriptomic and epigenomic data to interrogate the relationship of eRNA transcription to disease status and how genetic variants alter enhancer transcriptional activity in the human brain. We combined RNA-seq data from 537 post mortem brain samples from the CommonMind Consortium with cap analysis of gene expression and enhancer identification, using the assay for transposase-accessible chromatin followed by sequencing (ATACseq). We find 118 differentially transcribed eRNAs in schizophrenia and identify schizophrenia-associated gene/eRNA co-expression modules. Perturbations of a key module are associated with the polygenic risk scores. Furthermore, we identify genetic variants affecting expression of 927 enhancers, which we refer to as enhancer expression quantitative loci or eeQTLs. Enhancer expression patterns are consistent across studies, including differentially expressed eRNAs and eeQTLs. Combining eeQTLs with a genome-wide association study of schizophrenia identifies a genetic variant that alters enhancer function and expression of its target gene, GOLPH3L. Our novel approach to analyzing enhancer transcription is adaptable to other large-scale, non-poly-A depleted, RNA-seq studies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.