Motivated by a computer experiment for the design of a rocket booster, this paper explores nonstationary modeling methodologies that couple stationary Gaussian processes with treed partitioning. Partitioning is a simple but effective method for dealing with nonstationarity. The methodological developments and statistical computing details which make this approach efficient are described in detail. In addition to providing an analysis of the rocket booster simulator, our approach is demonstrated to be effective in other arenas. ). The authors would like to thank William Macready for originating the collaboration with NASA and for his help with the project, Thomas Pulliam and Edward Tejnil for their help with the NASA data, Tamara Broderick for her careful reading and detailed comments, and the editor, associate editor, and two anonymous referees for their helpful comments and suggestions.
We provide a new approach to approximate emulation of large computer experiments. By focusing expressly on desirable properties of the predictive equations, we derive a family of local sequential design schemes that dynamically define the support of a Gaussian process predictor based on a local subset of the data. We further derive expressions for fast sequential updating of all needed quantities as the local designs are built-up iteratively. Then we show how independent application of our local design strategy across the elements of a vast predictive grid facilitates a trivially parallel implementation. The end result is a global predictor able to take advantage of modern multicore architectures, providing a nonstationary modeling feature as a bonus. We demonstrate our method on two examples utilizing designs with thousands of data points, and compare to the method of compactly supported covariances.
The Gaussian process is an indispensable tool for spatial data analysts. The onset of the “big data” era, however, has lead to the traditional Gaussian process being computationally infeasible for modern spatial data. As such, various alternatives to the full Gaussian process that are more amenable to handling big spatial data have been proposed. These modern methods often exploit low-rank structures and/or multi-core and multi-threaded computing environments to facilitate computation. This study provides, first, an introductory overview of several methods for analyzing large spatial data. Second, this study describes the results of a predictive competition among the described methods as implemented by different groups with strong expertise in the methodology. Specifically, each research group was provided with two training datasets (one simulated and one observed) along with a set of prediction locations. Each group then wrote their own implementation of their method to produce predictions at the given location and each was subsequently run on a common computing environment. The methods were then compared in terms of various predictive diagnostics. Supplementary materials regarding implementation details of the methods and code are available for this article online. Electronic Supplementary Material Supplementary materials for this article are available at 10.1007/s13253-018-00348-w.
We present a unified view of likelihood based Gaussian progress regression for simulation experiments exhibiting input-dependent noise. Replication plays an important role in that context, however previous methods leveraging replicates have either ignored the computational savings that come from such design, or have short-cut full likelihood-based inference to remain tractable. Starting with homoskedastic processes, we show how multiple applications of a well-known Woodbury identity facilitate inference for all parameters under the likelihood (without approximation), bypassing the typical full-data sized calculations. We then borrow a latent-variable idea from machine learning to address heteroskedasticity, adapting it to work within the same thrifty inferential framework, thereby simultaneously leveraging the computational and statistical efficiency of designs with replication. The result is an inferential scheme that can be characterized as single objective function, complete with closed form derivatives, for rapid library-based optimization. Illustrations are provided, including real-world simulation experiments from manufacturing and the management of epidemics. 1 arXiv:1611.05902v2 [stat.ME] 13 Nov 2017 stochasticity. Whereas in the physical sciences solvers are often deterministic, or if they involve Monte Carlo then the rate of convergence is often known (Picheny and Ginsbourger, 2013), in the social and biological sciences simulations tend to involve randomly interacting agents. In that setting, signal-to-noise ratios can vary dramatically across experiments and for configuration (or input) spaces within experiments. We are motivated by two examples, from inventory control (Hong and Nelson, 2006;Xie et al., 2012) and online management of emerging epidemics (Hu and Ludkovski, 2017), which exhibit both features.Modeling methodology for large simulation efforts with intrinsic stochasticity is lagging.One attractive design tool is replication, i.e., repeated observations at identical inputs. Replication offers a glimpse at pure simulation variance, which is valuable for detecting a weak signal in high noise settings. Replication also holds the potential for computational savings through pre-averaging of repeated observations. It becomes doubly essential when the noise level varies in the input space. Although there are many ways to embellish the classical GP setup for heteroskedastic modeling, e.g., through choices of the covariance kernel, few acknowledge computational considerations. In fact, many exacerbate the problem. A notable exception is stochastic kriging (SK, Ankenman et al., 2010) which leverages replication for thriftier computation in low signal-to-noise regimes, where it is crucial to distinguish intrinsic stochasticity from extrinsic model uncertainty. However, SK has several drawbacks. Inference for unknowns is not based completely on the likelihood. It has the crutch of requiring (a minimal amount of) replication at each design site, which limits its application. Finally, the modeling and ext...
SignificanceForecasts routinely provide critical information for dangerous weather events but not yet for epidemics. Researchers develop computational models that can be used for infectious disease forecasting, but forecasts have not been broadly compared or tested. We collaboratively compared forecasts from 16 teams for 8 y of dengue epidemics in Peru and Puerto Rico. The comparison highlighted components that forecasts did well (e.g., situational awareness late in the season) and those that need more work (e.g., early season forecasts). It also identified key facets to improve forecasts, including using multiple model ensemble approaches to improve overall forecast skill. Future infectious disease forecasting work can build on these findings and this framework to improve the skill and utility of forecasts.
Computer experiments are often performed to allow modeling of a response surface of a physical experiment that can be too costly or difficult to run except using a simulator.Running the experiment over a dense grid can be prohibitively expensive, yet running over a sparse design chosen in advance can result in insufficient information in parts of the space, particularly when the surface calls for a nonstationary model. We propose an approach that automatically explores the space while simultaneously fitting the response surface, using predictive uncertainty to guide subsequent experimental runs.The newly developed Bayesian treed Gaussian process is used as the surrogate model, and a fully Bayesian approach allows explicit measures of uncertainty. We develop an adaptive sequential design framework to cope with an asynchronous, random, agentbased supercomputing environment, by using a hybrid approach that melds optimal strategies from the statistics literature with flexible strategies from the active learning literature. The merits of this approach are borne out in several examples, including the motivating computational fluid dynamics simulation of a rocket booster.
Gaussian process (GP) regression models make for powerful predictors in out of sample exercises, but cubic runtimes for dense matrix decompositions severely limit the size of data -training and testing -on which they can be deployed. That means that in computer experiment, spatial/geo-physical, and machine learning contexts, GPs no longer enjoy privileged status as data sets continue to balloon in size. We discuss an implementation of local approximate Gaussian process models, in the laGP package for R, that offers a particular sparse-matrix remedy uniquely positioned to leverage modern parallel computing architectures. The laGP approach can be seen as an update on the spatial statistical method of local kriging neighborhoods. We briefly review the method, and provide extensive illustrations of the features in the package through worked-code examples. The appendix covers custom building options for symmetric multi-processor and graphical processing units, and built-in wrapper routines that automate distribution over a simple network of workstations.
Most surrogate models for computer experiments are interpolators, and the most common interpolator is a Gaussian process (GP) that deliberately omits a small-scale (measurement) error term called the nugget. The explanation is that computer experiments are, by definition, "deterministic", and so there is no measurement error. We think this is too narrow a focus for a computer experiment and a statistically inefficient way to model them. We show that estimating a (non-zero) nugget can lead to surrogate models with better statistical properties, such as predictive accuracy and coverage, in a variety of common situations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.