Neural network gradient Hamiltonian Monte Carlo

Li, Lingge; Holbrook, Andrew J.; Shahbaba, Babak; Baldi, Pierre

doi:10.1007/s00180-018-00861-z

Cited by 16 publications

(9 citation statements)

References 7 publications

(12 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A different kind of question is whether one might make GPU and multi-core SIMD speedups available for a broader class of Bayesian models. Li et al (2019) use neural networks to approximate an arbitrary model's log-posterior gradient and thus avoid expensive HMC gradient computations in a Big Data setting. On the other hand, GPUs greatly accelerate fitting and evaluation of deep neural networks (Bergstra et al, 2011).…”

Section: Discussionmentioning

confidence: 99%

Massive parallelization boosts big Bayesian multidimensional scaling

Holbrook¹,

Lemey²,

Baele³

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Big Bayes is the computationally intensive co-application of big data and large, expressive Bayesian models for the analysis of complex phenomena in scientific inference and statistical learning. Standing as an example, Bayesian multidimensional scaling (MDS) can help scientists learn viral trajectories through space-time, but its computational burden prevents its wider use. Crucial MDS model calculations scale quadratically in the number of observations. We circumvent this limitation through massive parallelization using multi-core central processing units, instruction-level vectorization and graphics processing units (GPUs). Fitting the MDS model using Hamiltonian Monte Carlo, GPUs can deliver more than 100-fold speedups over serial calculations and thus extend Bayesian MDS to a big data setting. To illustrate, we employ Bayesian MDS to infer the rate at which different seasonal influenza virus subtypes use worldwide air traffic to spread around the globe. We examine 5392 viral sequences and their associated 14 million pairwise distances arising from the number of commercial airline seats per year between viral sampling locations. To adjust for shared evolutionary history of the viruses, we implement a phylogenetic extension to the MDS model and learn that subtype H3N2 spreads most effectively, consistent with its epidemic success relative to other seasonal influenza subtypes. We provide an open-source, stand-alone library along with a rudimentary R package and discuss program design and high-level implementation with an emphasis on important aspects of computing architecture that become relevant at scale.

show abstract

Section: Discussionmentioning

confidence: 99%

Massive parallelization boosts big Bayesian multidimensional scaling

Holbrook¹,

Lemey²,

Baele³

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The joint distribution for the GP W(s) is available in closed form but is cumbersome for large datasets; the joint distribution of the MSP R(s) is available only for a moderate number of spatial locations, and the joint distribution of the mixture model is more complicated that either of its components. An alternative is to build a surrogate likelihood for Bayesian computation (e.g., Rasmussen, 2003;Jabot et al, 2014;Wilkinson, 2014;Gutmann and Corander, 2016;Price et al, 2018;Drovandi et al, 2018;Wang and Li, 2018;Acerbi, 2018;Järvenpää et al, 2019Järvenpää et al, , 2021Li et al, 2019).…”

Section: Mixture Modelmentioning

confidence: 99%

Modeling Extremal Streamflow using Deep Learning Approximations and a Flexible Spatial Process

Majumder¹,

Reich²,

Shaby³

2022

Preprint

View full text Add to dashboard Cite

Quantifying changes in the probability and magnitude of extreme flooding events is key to mitigating their impacts. While hydrodynamic data are inherently spatially dependent, traditional spatial models such as Gaussian processes are poorly suited for modeling extreme events. Spatial extreme value models with more realistic tail dependence characteristics are under active development. They are theoretically justified, but give intractable likelihoods, making computation challenging for small datasets and prohibitive for continental-scale studies. We propose a process mixture model which specifies spatial dependence in extreme values as a convex combination of a Gaussian process and a max-stable process, yielding desirable tail dependence properties but intractable likelihoods. To address this, we employ a unique computational strategy where a feed-forward neural network is embedded in a density regression model to approximate the conditional distribution at one spatial location given a set of neighbors. We then use this univariate density function to approximate the joint likelihood for all locations by way of a Vecchia approximation. The process mixture model is used to analyze changes in annual maximum streamflow within the US over the last 50 years, and is able to detect areas which show increases in extreme streamflow over time.

show abstract

“…Rasmussen (2003) used GPs to jointly model the potential energy and its gradients. More recently, Li et al (2019) obtained better performance with a shallow neural network that is trained on gradient observations during early phases of the sampling procedure. With novel gradient inference routines we revisit the idea to replace ∇ x E by a GP gradient model that is trained on spatially diverse evaluations of the gradient during early phases of the sampling.…”

Section: Hamiltonian Monte Carlomentioning

confidence: 99%

High-Dimensional Gaussian Process Inference with Derivatives

de Roos,

Gessner,

Hennig

2021

Preprint

View full text Add to dashboard Cite

Although it is widely known that Gaussian processes can be conditioned on observations of the gradient, this functionality is of limited use due to the prohibitive computational cost of O(N 3 D 3 ) in data points N and dimension D. The dilemma of gradient observations is that a single one of them comes at the same cost as D independent function evaluations, so the latter are often preferred. Careful scrutiny reveals, however, that derivative observations give rise to highly structured kernel Gram matrices for very general classes of kernels (inter alia, stationary kernels). We show that in the low-data regime N < D, the Gram matrix can be decomposed in a manner that reduces the cost of inference to O(N 2 D +(N 2 ) 3 ) (i.e., linear in the number of dimensions) and, in special cases, to O(N 2 D+N 3 ). This reduction in complexity opens up new use-cases for inference with gradients especially in the high-dimensional regime, where the information-to-cost ratio of gradient observations significantly increases. We demonstrate this potential in a variety of tasks relevant for machine learning, such as optimization and Hamiltonian Monte Carlo with predictive gradients.

show abstract

Neural network gradient Hamiltonian Monte Carlo

Cited by 16 publications

References 7 publications

Massive parallelization boosts big Bayesian multidimensional scaling

Massive parallelization boosts big Bayesian multidimensional scaling

Modeling Extremal Streamflow using Deep Learning Approximations and a Flexible Spatial Process

High-Dimensional Gaussian Process Inference with Derivatives

Contact Info

Product

Resources

About