Genomic association studies of common or rare protein-coding variation have established robust statistical approaches to account for multiple testing. Here, we present a comparable framework to evaluate rare and de novo noncoding single nucleotide variants, insertion/deletions, and all classes of structural variation from whole-genome sequencing (WGS). Integrating genomic annotations at the level of nucleotides, genes, and regulatory regions, we define 51,801 annotation categories. Analyses of 519 autism spectrum disorder families did not identify association with any categories after correction for 4,123 effective tests. Without appropriate correction, biologically plausible associations are observed in both cases and controls. Despite excluding previously identified gene-disrupting mutations, coding regions still exhibited the strongest associations. Thus, in autism the contribution of de novo noncoding variation is probably modest compared to de novo coding variants. Robust results from future WGS studies will require large cohorts and comprehensive analytical strategies that consider the substantial multiple testing burden.
Whole-genome sequencing (WGS) has facilitated the first genome-wide evaluations of the contribution of de novo noncoding mutations to complex disorders. Using WGS, we assess genetic variation from 7,608 samples in 1,902 autism spectrum disorder (ASD) families, identifying 255,106 de novo mutations. In contrast to coding mutations, no noncoding functional annotation category, analyzed in isolation, is significantly associated with ASD. Casting noncoding variation in the context of a de novo risk score across multiple annotation categories, however, does demonstrate association with mutations localized to promoter regions. The strongest driver of this promoter signal emanates from evolutionarily conserved transcription factor binding sites distal to the transcription start site. These data suggest that de novo mutations in promoter regions, characterized by evolutionary and functional signatures, contribute to ASD.
Reliable uncertainty estimation for time series prediction is critical in many fields, including physics, biology, and manufacturing. At Uber, probabilistic time series forecasting is used for robust prediction of number of trips during special events, driver incentive allocation, as well as real-time anomaly detection across millions of metrics. Classical time series models are often used in conjunction with a probabilistic formulation for uncertainty estimation. However, such models are hard to tune, scale, and add exogenous variables to. Motivated by the recent resurgence of Long Short Term Memory networks, we propose a novel end-to-end Bayesian deep model that provides time series prediction along with uncertainty estimation. We provide detailed experiments of the proposed solution on completed trips data, and successfully apply it to large-scale time series anomaly detection at Uber.Comment: To appear in DSBDA-2017 @ ICDM'1
In this paper, which is the second in a series of two, the preasymptotic error analysis of the continuous interior penalty finite element method (CIP-FEM) and the FEM for the Helmholtz equation in two and three dimensions is continued. While Part I contained results on the linear CIP-FEM and FEM, the present part deals with approximation spaces of order p ≥ 1. By using a modified duality argument, preasymptotic error estimates are derived for both methods under the condition of kh, where k is the wave number, h is the mesh size, and C 0 is a constant independent of k, h, p, and the penalty parameters. It is shown that the pollution errors of both methods inif the exact solution u ∈ H 2 (Ω) which coincide with existent dispersion analyses for the FEM on Cartesian grids. Here σ is a constant independent of k, h, p and the penalty parameters. Moreover, it is proved that the CIP-FEM is stable for any k, h, p > 0 and penalty parameters with positive imaginary parts. Besides the advantage of the absolute stability of the CIP-FEM compared to the FEM, the penalty parameters may be tuned to reduce the pollution effects.Key words. Helmholtz equation, large wave number, preasymptotic error estimates, continuous interior penalty finite element methods, finite element methods
Recent advances in technology have enabled the measurement of RNA levels for individual cells. Compared to traditional tissue-level bulk RNA-seq data, single cell sequencing yields valuable insights about gene expression profiles for different cell types, which is potentially critical for understanding many complex human diseases. However, developing quantitative tools for such data remains challenging because of high levels of technical noise, especially the “dropout” events. A “dropout” happens when the RNA for a gene fails to be amplified prior to sequencing, producing a “false” zero in the observed data. In this paper, we propose a Unified RNA-Sequencing Model (URSM) for both single cell and bulk RNA-seq data, formulated as a hierarchical model. URSM borrows the strength from both data sources and carefully models the dropouts in single cell data, leading to a more accurate estimation of cell type specific gene expression profile. In addition, URSM naturally provides inference on the dropout entries in single cell data that need to be imputed for downstream analyses, as well as the mixing proportions of different cell types in bulk samples. We adopt an empirical Bayes’ approach, where parameters are estimated using the EM algorithm and approximate inference is obtained by Gibbs sampling. Simulation results illustrate that URSM outperforms existing approaches both in correcting for dropouts in single cell data, as well as in deconvolving bulk samples. We also demonstrate an application to gene expression data on fetal brains, where our model successfully imputes the dropout genes and reveals cell type specific expression patterns.
SignificanceGrowth typically involves differentiation of cells from progenitors into more specialized descendants, often involving lineages of pure and transitional cells to achieve final form. Recent technology has enabled estimation of gene expression profiles of single cells and these profiles theoretically differentiate pure cell types. What is missing from the analytical toolbox is an efficient technique to classify pure and transitional cells from their profiles. Here we propose semisoft clustering with pure cells (SOUP). This algorithm performs well in the hard-clustering problem for pure cell types and excels at identifying transitional cells with soft memberships. Moreover, SOUP provides an estimate of the developmental trajectories based on the estimated cell type membership that naturally adapts to cells in transition.
We propose and analyze a generic method for community recovery in stochastic block models and degree corrected block models. This approach can exactly recover the hidden communities with high probability when the expected node degrees are of order log n or higher. Starting from a roughly correct community partition given by some conventional community recovery algorithm, this method refines the partition in a cross clustering step. Our results simplify and extend some of the previous work on exact community recovery, discovering the key role played by sample splitting. The proposed method is simple and can be implemented with many practical community recovery algorithms.
Scientists routinely compare gene expression levels in cases versus controls in part to determine genes associated with a disease. Similarly, detecting case-control differences in co-expression among genes can be critical to understanding complex human diseases; however statistical methods have been limited by the high dimensional nature of this problem. In this paper, we construct a sparse-Leading-Eigenvalue-Driven (sLED) test for comparing two high-dimensional covariance matrices. By focusing on the spectrum of the differential matrix, sLED provides a novel perspective that accommodates what we assume to be common, namely sparse and weak signals in gene expression data, and it is closely related with Sparse Principal Component Analysis. We prove that sLED achieves full power asymptotically under mild assumptions, and simulation studies verify that it outperforms other existing procedures under many biologically plausible scenarios. Applying sLED to the largest gene-expression dataset obtained from post-mortem brain tissue from Schizophrenia patients and controls, we provide a novel list of genes implicated in Schizophrenia and reveal intriguing patterns in gene co-expression change for Schizophrenia subjects. We also illustrate that sLED can be generalized to compare other gene-gene “relationship” matrices that are of practical interest, such as the weighted adjacency matrices.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.