We develop scalable methods for fitting penalized regression spline based generalized additive models with of the order of 10 4 coefficients to up to 10 8 data. Computational feasibility rests on: (i) a new iteration scheme for estimation of model coefficients and smoothing parameters, avoiding poorly scaling matrix operations; (ii) parallelization of the iteration's pivoted block Cholesky and basic matrix operations; (iii) the marginal discretization of model covariates to reduce memory footprint, with efficient scalable methods for computing required crossproducts directly from the discrete representation. Marginal discretization enables much finer discretization than joint discretization would permit. We were motivated by the need to model four decades worth of daily particulate data from the U.K. Black Smoke and Sulphur Dioxide Monitoring Network. Although reduced in size recently, over 2000 stations have at some time been part of the network, resulting in some 10 million measurements. Modeling at a daily scale is desirable for accurate trend estimation and mapping, and to provide daily exposure estimates for epidemiological cohort studies. Because of the dataset size, previous work has focused on modeling time or space averaged pollution levels, but this is unsatisfactory from a health perspective, since it is often acute exposure locally and on the time scale of days that is of most importance in driving adverse health outcomes. If computed by conventional means our black smoke model would require a half terabyte of storage just for the model matrix, whereas we are able to compute with it on a desktop workstation. The best previously available reduced memory footprint method would have required three orders of magnitude more computing time than our new method. Supplementary materials for this article are available online.
Wood et al. (J Am Stat Assoc 112(519):1199-1210, 2017) developed methods for fitting penalized regression spline based generalized additive models, with of the order of 10 4 coefficients, to up to 10 8 data. The methods offered two to three orders of magnitude reduction in computational cost relative to the most efficient previous methods. Part of the gain resulted from the development of a set of methods for efficiently computing model matrix products when model covariates each take only a discrete set of values substantially smaller than the sample size [generalizing an idea first appearing in Lang et al. (Stat Comput 24(2):223-238, 2014)]. Covariates can always be rounded to achieve such discretization, and it should be noted that the covariate discretization is marginal. That is we do not rely on discretizing covariates jointly, which would typically require the use of very coarse discretization. The most expensive computation in model estimation is the formation of the matrix cross product X T WX where X is a model matrix and W a diagonal or tri-diagonal matrix. The purpose of this paper is to present a simple, novel and substantially more efficient approach to the computation of this cross product. The new method offers, for example, a 30 fold reduction in cross product computation time for the Black Smoke model dataset motivating Wood et al. (2017). Given this reduction in computational cost, the subsequent Cholesky decomposition of X T WX and follow on computation of (X T WX) −1 become a more significant part of the computational burden, and we also discuss the choice of methods for improving their speed.
Background Longitudinal data analysis can improve our understanding of the influences on health trajectories across the life-course. There are a variety of statistical models which can be used, and their fitting and interpretation can be complex, particularly where there is a nonlinear trajectory. Our aim was to provide an accessible guide along with applied examples to using four sophisticated modelling procedures for describing nonlinear growth trajectories. Methods This expository paper provides an illustrative guide to summarising nonlinear growth trajectories for repeatedly measured continuous outcomes using (i) linear spline and (ii) natural cubic spline linear mixed-effects (LME) models, (iii) Super Imposition by Translation and Rotation (SITAR) nonlinear mixed effects models, and (iv) latent trajectory models. The underlying model for each approach, their similarities and differences, and their advantages and disadvantages are described. Their application and correct interpretation of their results is illustrated by analysing repeated bone mass measures to characterise bone growth patterns and their sex differences in three cohort studies from the UK, USA, and Canada comprising 8500 individuals and 37,000 measurements from ages 5–40 years. Recommendations for choosing a modelling approach are provided along with a discussion and signposting on further modelling extensions for analysing trajectory exposures and outcomes, and multiple cohorts. Results Linear and natural cubic spline LME models and SITAR provided similar summary of the mean bone growth trajectory and growth velocity, and the sex differences in growth patterns. Growth velocity (in grams/year) peaked during adolescence, and peaked earlier in females than males e.g., mean age at peak bone mineral content accrual from multicohort SITAR models was 12.2 years in females and 13.9 years in males. Latent trajectory models (with trajectory shapes estimated using a natural cubic spline) identified up to four subgroups of individuals with distinct trajectories throughout adolescence. Conclusions LME models with linear and natural cubic splines, SITAR, and latent trajectory models are useful for describing nonlinear growth trajectories, and these methods can be adapted for other complex traits. Choice of method depends on the research aims, complexity of the trajectory, and available data. Scripts and synthetic datasets are provided for readers to replicate trajectory modelling and visualisation using the R statistical computing software.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.