Let W be a random variable with mean zero and variance σ 2 .
Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D2 statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D2 statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D2 word count statistic, which we call D2S and D2∗. For D2S, which is a self-standardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic, D2∗, outperforms D2S in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of D2∗, we cannot provide a closed form for power calculations.
We compute explicit bounds in the normal and chi-square approximations of multilinear homogenous sums (of arbitrary order) of general centered independent random variables with unit variance. In particular, we show that chaotic random variables enjoy the following form of universality: (a) the normal and chi-square approximations of any homogenous sum can be completely characterized and assessed by first switching to its Wiener chaos counterpart, and (b) the simple upper bounds and convergence criteria available on the Wiener chaos extend almost verbatim to the class of homogeneous sums. . This reprint differs from the original in pagination and typographic detail. 1 2 I. NOURDIN, G. PECCATI AND G. REINERTOur findings partially rely on the notion of "low influences" (see again [10]) for real-valued functions defined on product spaces. As indicated by the title, we regard the two properties (a) and (b) as an instance of the universality phenomenon, according to which most information about large random systems (such as the "distance to Gaussian" of nonlinear functionals of large samples of independent random variables) does not depend on the particular distribution of the components. Other recent examples of the universality phenomenon appear in the already quoted paper [10], as well as in the Tao-Vu proof of the circular law for random matrices, as detailed in [31] (see also the Appendix to [31] by Krishnapur). Observe that, in Section 7, we will prove analogous results for the multivariate normal approximation of vectors of homogenous sums of possibly different orders. In a further work by the first two authors (see [14]) the results of the present paper are applied in order to deduce universal Gaussian fluctuations for traces associated with non-Hermitian matrix ensembles.
In this paper we establish a multivariate exchangeable pairs approach within the framework of Stein's method to assess distributional distances to potentially singular multivariate normal distributions. By extending the statistics into a higher-dimensional space, we also propose an embedding method which allows for a normal approximation even when the corresponding statistics of interest do not lend themselves easily to Stein's exchangeable pairs approach. To illustrate the method, we provide the examples of runs on the line as well as double-indexed permutation statistics.Heuristically, (1.1) can be understood as a linear regression condition. If (W, W ′ ) were bivariate normal with correlation ρ, then
In the following, an overview is given on statistical and probabilistic properties of words, as occurring in the analysis of biological sequences. Counts of occurrence, counts of clumps, and renewal counts are distinguished, and exact distributions as well as normal approximations, Poisson process approximations, and compound Poisson approximations are derived. Here, a sequence is modelled as a stationary ergodic Markov chain; a test for determining the appropriate order of the Markov chain is described. The convergence results take the error made by estimating the Markovian transition probabilities into account. The main tools involved are moment generating functions, martingales, Stein's method, and the Chen-Stein method. Similar results are given for occurrences of multiple patterns, and, as an example, the problem of unique recoverability of a sequence from SBH chip data is discussed. Special emphasis lies on disentangling the complicated dependence structure between word occurrences, due to self-overlap as well as due to overlap between words. The results can be used to derive approximate, and conservative, confidence intervals for tests.
Rapid methods for alignment-free sequence comparison make large-scale comparisons between sequences increasingly feasible. Here we study the power of the statistic D 2 , which counts the number of matching k-tuples between two sequences, as well as D 2 * , which uses centralized counts, and D 2 S , which is a self-standardized version, both from a theoretical viewpoint and numerically, providing an easy to use program. The power is assessed under two alternative hidden Markov models; the first one assumes that the two sequences share a common motif, whereas the second model is a pattern transfer model; the null model is that the two sequences are composed of independent and identically distributed letters and they are independent. Under the first alternative model, the means of the tuple counts in the individual sequences change, whereas under the second alternative model, the marginal means are the same as under the null model. Using the limit distributions of the count statistics under the null and the alternative models, we find that generally, asymptotically D 2 S has the largest power, followed by D 2 * , whereas the power of D 2 can even be zero in some cases. In contrast, even for sequences of length 140,000 bp, in simulations D 2 * generally has the largest power. Under the first alternative model of a shared motif, the power of D 2 * approaches 100% when sufficiently many motifs are shared, and we recommend the use of D 2 * for such practical applications. Under the second alternative model of pattern transfer, the power for all three count statistics does not increase with sequence length when the sequence is sufficiently long, and hence none of the three statistics under consideration can be recommended in such a situation. We illustrate the approach on 323 transcription factor binding motifs with length at most 10 from JASPAR CORE (October 12, 2009 version)
Community detection, the division of a network into dense subnetworks with only sparse connections between them, has been a topic of vigorous study in recent years. However, while there exist a range of powerful and flexible methods for dividing a network into a specified number of communities, it is an open question how to determine exactly how many communities one should use. Here we describe a mathematically principled approach for finding the number of communities in a network using a maximum-likelihood method. We demonstrate the approach on a range of real-world examples with known community structure, finding that it is able to determine the number of communities correctly in every case.
We propose a new general version of Stein's method for univariate distributions. In particular we propose a canonical definition of the Stein operator of a probability distribution which is based on a linear difference or differential-type operator. The resulting Stein identity highlights the unifying theme behind the literature on Stein's method (both for continuous and discrete distributions). Viewing the Stein operator as an operator acting on pairs of functions, we provide an extensive toolkit for distributional comparisons. Several abstract approximation theorems are provided. Our approach is illustrated for comparison of several pairs of distributions : normal vs normal, sums of independent Rademacher vs normal, normal vs Student, and maximum of random variables vs exponential, Fréchet and Gumbel.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.