In the context of exponential growing molecular databases, it becomes increasingly easy to assemble large multigene data sets for phylogenomic studies. The expected increase of resolution due to the reduction of the sampling (stochastic) error is becoming a reality. However, the impact of systematic biases will also become more apparent or even dominant. We have chosen to study the case of the long-branch attraction artefact (LBA) using real instead of simulated sequences. Two fast-evolving eukaryotic lineages, whose evolutionary positions are well established, microsporidia and the nucleomorph of cryptophytes, were chosen as model species. A large data set was assembled (44 species, 133 genes, and 24,294 amino acid positions) and the resulting rooted eukaryotic phylogeny (using a distant archaeal outgroup) is positively misled by an LBA artefact despite the use of a maximum likelihood-based tree reconstruction method with a complex model of sequence evolution. When the fastest evolving proteins from the fast lineages are progressively removed (up to 90%), the bootstrap support for the apparently artefactual basal placement decreases to virtually 0%, and conversely only the expected placement, among all the possible locations of the fast-evolving species, receives increasing support that eventually converges to 100%. The percentage of removal of the fastest evolving proteins constitutes a reliable estimate of the sensitivity of phylogenetic inference to LBA. This protocol confirms that both a rich species sampling (especially the presence of a species that is closely related to the fast-evolving lineage) and a probabilistic method with a complex model are important to overcome the LBA artefact. Finally, we observed that phylogenetic inference methods perform strikingly better with simulated as opposed to real data, and suggest that testing the reliability of phylogenetic inference methods with simulated data leads to overconfidence in their performance. Although phylogenomic studies can be affected by systematic biases, the possibility of discarding a large amount of data containing most of the nonphylogenetic signal allows recovering a phylogeny that is less affected by systematic biases, while maintaining a high statistical support.
Abstract. In this paper the filtering of partially observed diffusions, with discrete-time observations, is considered. It is assumed that only biased approximations of the diffusion can be obtained, for choice of an accuracy parameter indexed by l. A multilevel estimator is proposed, consisting of a telescopic sum of increment estimators associated to the successive levels. The work associated to O(ε 2 ) mean-square error between the multilevel estimator and average with respect to the filtering distribution is shown to scale optimally, for example as O(ε −2 ) for optimal rates of convergence of the underlying diffusion approximation. The method is illustrated on some toy examples as well as estimation of interest rate based on real S&P 500 stock price data.
In this article we consider the approximation of expectations w.r.t. probability distributions associated to the solution of partial differential equations (PDEs); this scenario appears routinely in Bayesian inverse problems. In practice, one often has to solve the associated PDE numerically, using, for instance finite element methods and leading to a discretisation bias, with the step-size level h L . In addition, the expectation cannot be computed analytically and one often resorts to Monte Carlo methods. In the context of this problem, it is known that the introduction of the multilevel Monte Carlo (MLMC) method can reduce the amount of computational effort to estimate expectations, for a given level of error. This is achieved via a telescoping identity associated to a Monte Carlo approximation of a sequence of probability distributions with discretisation levels ∞ > h 0 > h 1 · · · > h L . In many practical problems of interest, one cannot achieve an i.i.d. sampling of the associated sequence of probability distributions. A sequential Monte Carlo (SMC) version of the MLMC method is introduced to deal with this problem. It is shown that under appropriate assumptions, the attractive property of a reduction of the amount of computational effort to estimate expectations, for a given level of error, can be maintained within the SMC context. The approach is numerically illustrated on a Bayesian inverse problem.
Model comparison for the purposes of selection, averaging, and validation is a problem found throughout statistics. Within the Bayesian paradigm, these problems all require the calculation of the posterior probabilities of models within a particular class. Substantial progress has been made in recent years, but difficulties remain in the implementation of existing schemes. This article presents adaptive sequential Monte Carlo (SMC) sampling strategies to characterize the posterior distribution of a collection of models, as well as the parameters of those models. Both a simple product estimator and a combination of SMC and a path sampling estimator are considered and existing theoretical results are extended to include the path sampling variant. A novel approach to the automatic specification of distributions within SMC algorithms is presented and shown to outperform the state of the art in this area. The performance of the proposed strategies is demonstrated via an extensive empirical study. Comparisons with stateof-the-art algorithms show that the proposed algorithms are always competitive, and often substantially superior to alternative techniques, at equal computational cost and considerably less application-specific implementation effort. Supplementary materials for this article are available online.
Background: Probabilistic methods have progressively supplanted the Maximum Parsimony (MP) method for inferring phylogenetic trees. One of the major reasons for this shift was that MP is much more sensitive to the Long Branch Attraction (LBA) artefact than is Maximum Likelihood (ML). However, recent work by Kolaczkowski and Thornton suggested, on the basis of simulations, that MP is less sensitive than ML to tree reconstruction artefacts generated by heterotachy, a phenomenon that corresponds to shifts in site-specific evolutionary rates over time. These results led these authors to recommend that the results of ML and MP analyses should be both reported and interpreted with the same caution. This specific conclusion revived the debate on the choice of the most accurate phylogenetic method for analysing real data in which various types of heterogeneities occur. However, variation of evolutionary rates across species was not explicitly incorporated in the original study of Kolaczkowski and Thornton, and in most of the subsequent heterotachous simulations published to date, where all terminal branch lengths were kept equal, an assumption that is biologically unrealistic.
The flow of fresh groundwater to the ocean through the coast (fresh submarine groundwater discharge or fresh SGD) plays an important role in global biogeochemical cycles and coastal water quality. In addition to delivering dissolved elements from land to sea, fresh SGD forms a natural barrier against salinization of coastal aquifers. Here we estimate groundwater discharge rates through the near‐global coast (60°N to 60°S) at high resolution using a water budget approach. We find that tropical coasts export more than 56% of all fresh SGD, while midlatitude arid regions export only 10%. Fresh SGD rates from tectonically active margins (coastlines along tectonic plate boundaries) are also significantly greater than passive margins, where most field studies have been focused. Active margins combine rapid uplift and weathering with high rates of fresh SGD and may therefore host exceptionally large groundwater‐borne solute fluxes to the coast.
Beskos et al. AbstractWe consider the numerical approximation of the filtering problem in high dimensions, that is when the hidden state lies in R d with d large. For low dimensional problems, one of the most popular numerical procedures for consistent inference is the class of approximations termed particle filters or sequential Monte Carlo methods. However, in high dimensions, standard particle filters (e.g. the bootstrap particle filter) can have a cost that is exponential in d for the algorithm to be stable in an appropriate sense. We develop a new particle filter, called the space-time particle filter, for a specific family of state-space models in discrete time. This new class of particle filters provide consistent Monte Carlo estimates for any fixed d, as do standard particle filters. Moreover, when there is a spatial mixing element in the dimension of the state-vector, the space-time particle filter will scale much better with d than the standard filter for a class of filtering problems. We illustrate this analytically for a model of a simple i.i.d. structure and one ofspace-direction, when we show that the algorithm exhibits certain stability properties as d increases at a cost O(nN d 2 ), where n is the time parameter and N is the number of Monte Carlo samples, that are fixed and independent of d. Our theoretical results are also supported by numerical simulations on practical models of complex structures. The results suggest that it is indeed possible to tackle some high dimensional filtering problems using the space-time particle filter that standard particle filters cannot handle.
In this article we consider computing expectations w.r.t. probability laws associated to a certain class of stochastic systems. In order to achieve such a task, one must not only resort to numerical approximation of the expectation, but also to a biased discretization of the associated probability. We are concerned with the situation for which the discretization is required in multiple dimensions, for instance in space-time.In such contexts, it is known that the multi-index Monte Carlo (MIMC) method of [7] can improve upon i.i.d. sampling from the most accurate approximation of the probability law. Through a non-trivial modification of the multilevel Monte Carlo (MLMC) method, this method can reduce the work to obtain a given level of error, relative to i.i.d. sampling and relative even to MLMC. In this article we consider the case when such probability laws are too complex to be sampled independently, for example a Bayesian inverse problem where evaluation of the likelihood requires solution of a partial differential equation (PDE) model which needs to be approximated at finite resolution. We develop a modification of the MIMC method which allows one to use standard Markov chain Monte Carlo (MCMC) algorithms to replace independent and coupled sampling, in certain contexts. We prove a variance theorem for a simplified estimator which shows that using our MIMCMC method is preferable, in the sense above, to i.i.d. sampling from the most accurate approximation, under appropriate assumptions. The method is numerically illustrated on a Bayesian inverse problem associated to a stochastic partial differential equation (SPDE), where the path measure is conditioned on some observations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.