Whole-genome sequencing (WGS) has facilitated the first genome-wide evaluations of the contribution of de novo noncoding mutations to complex disorders. Using WGS, we assess genetic variation from 7,608 samples in 1,902 autism spectrum disorder (ASD) families, identifying 255,106 de novo mutations. In contrast to coding mutations, no noncoding functional annotation category, analyzed in isolation, is significantly associated with ASD. Casting noncoding variation in the context of a de novo risk score across multiple annotation categories, however, does demonstrate association with mutations localized to promoter regions. The strongest driver of this promoter signal emanates from evolutionarily conserved transcription factor binding sites distal to the transcription start site. These data suggest that de novo mutations in promoter regions, characterized by evolutionary and functional signatures, contribute to ASD.
Changepoint detection methods are used in many areas of science and engineering, for example, in the analysis of copy number variation data to detect abnormalities in copy numbers along the genome. Despite the broad array of available tools, methodology for quantifying our uncertainty in the strength (or the presence) of given changepoints post-selection are lacking. Post-selection inference offers a framework to fill this gap, but the most straightforward application of these methods results in low-powered hypothesis tests and leaves open several important questions about practical usability. In this work, we carefully tailor post-selection inference methods toward changepoint detection, focusing on copy number variation data. To accomplish this, we study commonly used changepoint algorithms: binary segmentation, as well as two of its most popular variants, wild and circular, and the fused lasso. We implement some of the latest developments in post-selection inference theory, mainly auxiliary randomization. This improves the power, which requires implementations of Markov chain Monte Carlo algorithms (importance sampling and hit-and-run sampling) to carry out our tests. We also provide recommendations for improving practical useability, detailed simulations, and example analyses on array comparative genomic hybridization as well as sequencing data.
Fibrinogen-related domains (FReDs) are found in a variety of animal proteins with widely different functions, ranging from non-self recognition to clot formation. All appear to have a common surface where binding of one sort or other occurs. An examination of 19 completed animal genomes-including a sponge and sea anemone, six protostomes, and 11 deuterostomeshas allowed phylogenies to be constructed that show where various types of FReP (proteins containing FReDs) first made their appearance. Comparisons of sequences and structures also reveal particular features that correlate with function, including the influence of neighbor-domains. A particular set of insertions in the carboxyl-terminal subdomain was involved in the transition from structures known to bind sugars to those known to bind amino-terminal peptides. Perhaps not unexpectedly, FReDs with different functions have changed at different rates, with ficolins by far the fastest changing group. Significantly, the greatest amount of change in ficolin FReDs occurs in the third subdomain (''P domain''), the very opposite of the situation in most other vertebrate FReDs. The unbalanced style of change was also observed in FReDs from nonchordates, many of which have been implicated in innate immunity.
Detecting when the underlying distribution changes from the observed time series is a fundamental problem arising in a broad spectrum of applications. Change point localization is particularly challenging when we only observe low-dimensional projections of high-dimensional random variables. Specifically, we assume we observe {x t , y t } n t=1 where {x t } n t=1 are p-dimensional covariates, {y t } n t=1 are the univariate responses satisfying E(y t ) = x ⊤ t β * t for all 1 ≤ t ≤ n and that {β * t } n t=1 are the unobserved regression parameters that change over time in a piecewise constant manner. We first propose a novel algorithm called Binary Segmentation through Estimated CUSUM statistics (BSE), which computes the change points through direct estimates of the CUSUM statistics of {β * t } n t=1. We show that BSE can consistently estimate the unknown location of the change points, achieving error bounds of order O(log(p)/n). To the best of our knowledge, this is a significant improvement, as the state-of-the-art methods are only shown to achieve error bounds of order O(log(p)/ √ n) in the multiple change point setting. However, BSE can be computationally costly when the number change points is large. To overcome this limitation, we introduce another new algorithm called Binary Segmentation through Lasso Estimators (BSLE). We show that BSLE can consistently localize change points with a slightly worse localization error rate compared to BSE, but BSLE is much more computationally efficient. Finally, we leverage the insights gained from BSE and BSLE to develop a novel "local screening" algorithm that can input a coarse estimate of change point locations together with the observed data and efficiently refine that estimate, allowing us to improve the practical performance of past estimators based on group lasso. All of our newly proposed algorithms have good performance in our simulated experiments, especially when the size of changes in the regression parameters {β * t } n t=1 is small.
Scientists often embed cells into a lower-dimensional space when studying single-cell RNA-seq data for improved downstream analyses such as developmental trajectory analyses, but the statistical properties of such non-linear embedding methods are often not well understood. In this article, we develop the eSVD (exponential-family SVD), a non-linear embedding method for both cells and genes jointly with respect to a random dot product model using exponential-family distributions. Our estimator uses alternating minimization, which enables us to have a computationally-efficient method, prove the identifiability conditions and consistency of our method, and provide statistically-principled procedures to tune our method. All these qualities help advance the single-cell embedding literature, and we provide extensive simulations to demonstrate that the eSVD is competitive compared to other embedding methods.We apply the eSVD via Gaussian distributions where the standard deviations are proportional to the means to analyze a single-cell dataset of oligodendrocytes in mouse brains (Marques et al., 2016). Using the eSVD estimated embedding, we then investigate the cell developmental trajectories of the oligodendrocytes. While previous results are not able to distinguish the trajectories among the mature oligodendrocyte cell types, our diagnostics and results demonstrate there are two major developmental trajectories that diverge at mature oligodendrocytes.
We develop tail probability bounds for matrix linear combinations with matrixvalued coefficients and matrix-valued quadratic forms. These results extend well-known scalar case results such as the Hanson-Wright inequality, and matrix concentration inequalities such as the matrix Bernstein inequality. A key intermediate result is a deviation bound for matrix-valued U -statistics of order two and their independent sums. As an application of these probability tools in statistical inference, we establish the consistency of a novel bias-adjusted spectral clustering method in multi-layer stochastic block models with general signal structures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.