A Bayesian non-parametric methodology has been recently proposed to deal with the issue of prediction within species sampling problems. Such problems concern the evaluation, conditional on a sample of size "n", of the species variety featured by an additional sample of size "m". Genomic applications pose the additional challenge of having to deal with large values of both "n" and "m". In such a case the computation of the Bayesian non-parametric estimators is cumbersome and prevents their implementation. We focus on the two-parameter Poisson-Dirichlet model and provide completely explicit expressions for the corresponding estimators, which can be easily evaluated for any sizes of "n" and "m". We also study the asymptotic behaviour of the number of new species conditionally on the observed sample: such an asymptotic result, combined with a suitable simulation scheme, allows us to derive asymptotic highest posterior density intervals for the estimates of interest. Finally, we illustrate the implementation of the proposed methodology by the analysis of five expressed sequence tags data sets. Copyright (c) 2009 Royal Statistical Society.
Discrete random probability measures and the exchangeable random partitions they induce are key tools for addressing a variety of estimation and prediction problems in Bayesian inference. Here we focus on the family of Gibbs-type priors, a recent elegant generalization of the Dirichlet and the Pitman-Yor process priors. These random probability measures share properties that are appealing both from a theoretical and an applied point of view: (i) they admit an intuitive predictive characterization justifying their use in terms of a precise assumption on the learning mechanism; (ii) they stand out in terms of mathematical tractability; (iii) they include several interesting special cases besides the Dirichlet and the Pitman-Yor processes. The goal of our paper is to provide a systematic and unified treatment of Gibbs-type priors and highlight their implications for Bayesian nonparametric inference. We deal with their distributional properties, the resulting estimators, frequentist asymptotic validation and the construction of time-dependent versions. Applications, mainly concerning mixture models and species sampling, serve to convey the main ideas. The intuition inherent to this class of priors and the neat results they lead to make one wonder whether it actually represents the most natural generalization of the Dirichlet process.
This paper concerns the use of Markov chain Monte Carlo methods for posterior sampling in Bayesian nonparametric mixture models with normalized random measure priors. Making use of some recent posterior characterizations for the class of normalized random measures, we propose novel Markov chain Monte Carlo methods of both marginal type and conditional type. The proposed marginal samplers are generalizations of Neal's well-regarded Algorithm 8 for Dirichlet process mixture models, whereas the conditional sampler is a variation of those recently introduced in the literature. For both the marginal and conditional methods, we consider as a running example a mixture model with an underlying normalized generalized Gamma process prior, and describe comparative simulation results demonstrating the efficacies of the proposed methods.Comment: Published in at http://dx.doi.org/10.1214/13-STS422 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org
Gibbs-type random probability measures and the exchangeable random partitions they induce represent an important framework both from a theoretical and applied point of view. In the present paper, motivated by species sampling problems, we investigate some properties concerning the conditional distribution of the number of blocks with a certain frequency generated by Gibbs-type random partitions. The general results are then specialized to three noteworthy examples yielding completely explicit expressions of their distributions, moments and asymptotic behaviors. Such expressions can be interpreted as Bayesian nonparametric estimators of the rare species variety and their performance is tested on some real genomic data.Comment: Published in at http://dx.doi.org/10.1214/12-AAP843 the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org
Summary. Species sampling problems have a long history in ecological and biological studies and a number of issues, including the evaluation of species richness, the design of sampling experiments, and the estimation of rare species variety, are to be addressed. Such inferential problems have recently emerged also in genomic applications, however, exhibiting some peculiar features that make them more challenging: specifically, one has to deal with very large populations (genomic libraries) containing a huge number of distinct species (genes) and only a small portion of the library has been sampled (sequenced). These aspects motivate the Bayesian nonparametric approach we undertake, since it allows to achieve the degree of flexibility typically needed in this framework. Based on an observed sample of size n, focus will be on prediction of a key aspect of the outcome from an additional sample of size m, namely, the so-called discovery probability. In particular, conditionally on an observed basic sample of size n, we derive a novel estimator of the probability of detecting, at the (n + m + 1)th observation, species that have been observed with any given frequency in the enlarged sample of size n + m. Such an estimator admits a closed-form expression that can be exactly evaluated. The result we obtain allows us to quantify both the rate at which rare species are detected and the achieved sample coverage of abundant species, as m increases. Natural applications are represented by the estimation of the probability of discovering rare genes within genomic libraries and the results are illustrated by means of two expressed sequence tags datasets.
In the present paper we define and investigate a novel class of distributions on the simplex, termed normalized infinitely divisible distributions, which includes the Dirichlet distribution. Distributional properties and general moment formulae are derived. Particular attention is devoted to special cases of normalized infinitely divisible distributions which lead to explicit expressions. As a by-product also new distributions over the unit interval and a generalization of the Bessel function distribution are obtained.
Human microbiome studies use sequencing technologies to measure the abundance of bacterial species or Operational Taxonomic Units (OTUs) in samples of biological material. Typically the data are organized in contingency tables with OTU counts across heterogeneous biological samples. In the microbial ecology community, ordination methods are frequently used to investigate latent factors or clusters that capture and describe variations of OTU counts across biological samples. It remains important to evaluate how uncertainty in estimates of each biological sample’s microbial distribution propagates to ordination analyses, including visualization of clusters and projections of biological samples on low dimensional spaces. We propose a Bayesian analysis for dependent distributions to endow frequently used ordinations with estimates of uncertainty. A Bayesian nonparametric prior for dependent normalized random measures is constructed, which is marginally equivalent to the normalized generalized Gamma process, a well-known prior for nonparametric analyses. In our prior, the dependence and similarity between microbial distributions is represented by latent factors that concentrate in a low dimensional space. We use a shrinkage prior to tune the dimensionality of the latent factors. The resulting posterior samples of model parameters can be used to evaluate uncertainty in analyses routinely applied in microbiome studies. Specifically, by combining them with multivariate data analysis techniques we can visualize credible regions in ecological ordination plots. The characteristics of the proposed model are illustrated through a simulation study and applications in two microbiome datasets.
Random probability measures are the main tool for Bayesian nonparametric inference, with their laws acting as prior distributions. Many well-known priors used in practice admit different, though (in distribution) equivalent, representations. Some of these are convenient if one wishes to thoroughly analyze the theoretical properties of the priors being used, others are more useful for modeling dependence and for addressing computational issues. As for the latter purpose, so-called stick-breaking constructions certainly stand out. In this paper we focus on the recently introduced normalized inverse Gaussian process and provide a completely explicit stick-breaking representation for it. Such a new result is of interest both from a theoretical viewpoint and for statistical practice.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.