We argue that model selection uncertainty should be fully incorporated into statistical inference whenever estimation is sensitive to model choice and that choice is made with reference to the data. We consider different philosophies for achieving this goal and suggest strategies for data analysis. We illustrate our methods through three examples. The first is a Poisson regression of bird counts in which a choice is to be made between inclusion of one or both of two covariates. The second is a line transect data set for which different models yield substantially different estimates of abundance. The third is a simulated example in which truth is known.
Generalized Additive Models (GAMs) have been popularized by the work of Hastie and Tibshirani (1990) and the availability of user friendly GAM software in Splus. However, whilst it is flexible and efficient, the GAM framework based on backfitting with linear smoothers presents some difficulties when it comes to model selection and inference. On the other hand, the mathematically elegant work of Wahba (1990) and co-workers on Generalized Spline Smoothing (GSS) provides a rigorous framework for model selection (Gu and Wahba, 1991) and inference with GAMs constructed from smoothing splines: but unfortunately these models are computationally very expensive with operations counts that are of cubic order in the number of data. A "middle way" between these approaches is to construct GAMs using penalized regression splines (see e.g.
Proteolysis in close vicinity of tumor cells is a hallmark of cancer invasion and metastasis. We show here that mouse mammary tumor virus-polyoma middle T antigen (PyMT) transgenic mice deficient for the cysteine protease cathepsin B (CTSB) exhibited a significantly delayed onset and reduced growth rate of mammary cancers compared with wild-type PyMT mice. with PyMT;ctsb +/+ cells, was used to address the role of stroma-derived CTSB in lung metastasis formation. Notably, ctsb À/À mice showed reduced number and volume of lung colonies, and infiltrating macrophages showed a strongly up-regulated expression of CTSB within metastatic cell populations. These results indicate that both cancer cellderived and stroma cell-derived (i.e., macrophages) CTSB plays an important role in tumor progression and metastasis.
We develop scalable methods for fitting penalized regression spline based generalized additive models with of the order of 10 4 coefficients to up to 10 8 data. Computational feasibility rests on: (i) a new iteration scheme for estimation of model coefficients and smoothing parameters, avoiding poorly scaling matrix operations; (ii) parallelization of the iteration's pivoted block Cholesky and basic matrix operations; (iii) the marginal discretization of model covariates to reduce memory footprint, with efficient scalable methods for computing required crossproducts directly from the discrete representation. Marginal discretization enables much finer discretization than joint discretization would permit. We were motivated by the need to model four decades worth of daily particulate data from the U.K. Black Smoke and Sulphur Dioxide Monitoring Network. Although reduced in size recently, over 2000 stations have at some time been part of the network, resulting in some 10 million measurements. Modeling at a daily scale is desirable for accurate trend estimation and mapping, and to provide daily exposure estimates for epidemiological cohort studies. Because of the dataset size, previous work has focused on modeling time or space averaged pollution levels, but this is unsatisfactory from a health perspective, since it is often acute exposure locally and on the time scale of days that is of most importance in driving adverse health outcomes. If computed by conventional means our black smoke model would require a half terabyte of storage just for the model matrix, whereas we are able to compute with it on a desktop workstation. The best previously available reduced memory footprint method would have required three orders of magnitude more computing time than our new method. Supplementary materials for this article are available online.
Forest health monitoring schemes were set up across Europe in the 1980's in re sponse to concern about air pollution related forest die back (Waldsterben) and have continued since then. Recent threats to forest health are climatic extremes likely to be due to global climate change, increased ground ozone levels and nitrogen deposi tion. We model yearly data on tree crown defoliation, an indicator of tree health, from a monitoring survey carried out in Baden-Württemberg, Germany since 1983. On a changing irregular grid, defoliation and other site specific variables are recorded. In Baden-Württemberg the temporal trend of defoliation differs between areas because of site characteristics and pollution levels, making it necessary to allow for space-time in teraction in the model. For this purpose we propose to use generalized additive mixed
In spatial regression models, collinearity between covariates and spatial effects can lead to significant bias in effect estimates. This problem, known as spatial confounding, is encountered modeling forestry data to assess the effect of temperature on tree health. Reliable inference is difficult as results depend on whether or not spatial effects are included in the model. We propose a novel approach, spatial+, for dealing with spatial confounding when the covariate of interest is spatially dependent but not fully determined by spatial location. Using a thin plate spline model formulation we see that, in this case, the bias in covariate effect estimates is a direct result of spatial smoothing. Spatial+ reduces the sensitivity of the estimates to smoothing by replacing the covariates by their residuals after spatial dependence has been regressed away. Through asymptotic analysis we show that spatial+ avoids the bias problems of the spatial model. This is also demonstrated in a simulation study. Spatial+ is straightforward to implement using existing software and, as the response variable is the same as that of the spatial model, standard model selection criteria can be used for comparisons. A major advantage of the method is also that it extends to models with non‐Gaussian response distributions. Finally, while our results are derived in a thin plate spline setting, the spatial+ methodology transfers easily to other spatial model formulations.
Summary1. This study presents statistical methodology that uses spatial explanatory variables to improve simpler estimates of transition probabilities from categorical data, such as vegetation type, that have been recorded as classified cells (pixels) in a grid or lattice at different times. 2. A specific application is to examine successions in semi-natural vegetation in north-east Scotland. Questions related to these data include: Do transition probabilities of a pixel depend on the size of a patch of vegetation (polygon) and pixel location within the polygon? Do stable areas remain stable? Does the proximity of certain vegetation types influence transitions? 3. We selected spatial variables that were likely to be important in this application, where short-range vegetative spread was thought to be an important factor. 4. The multinomial logit model is used to estimate the transition probabilities as a function of explanatory variables, including location, neighbourhood information and other factors recorded at the start of the transition period. This model allowed the testing of different assumptions about the dynamics of underlying processes leading to transitions. 5. When the number of categories, for example vegetation types, observed is large in comparison to the sample size, estimates of transition probabilities can be unreliable. We show that using change of category within the time period as the response in a logistic regression can still provide insight to the underlying dynamics of change in such a case. 6. The methods are illustrated with some Scottish vegetation classification data with pixels of size 5 × 5 m covering a square of area 0·25 km 2 . Two contrasting squares were investigated: the first was upland moorland grazed by sheep and the second was a lowland area with more varied vegetation and low intensity grazing by cattle. 7. In both squares there are strong spatial trends, and the neighbourhood of a pixel affected its transition. Prediction misclassification rates estimated from different models were compared using K-fold cross-validation. The multinomial model, including position in the square and number of neighbouring pixels in the same category as the pixel modelled, reduced the misclassification rate compared with the model without spatial explanatory variables. 8. The improved estimates of transition probabilities could be incorporated into Markov models used in simulation studies to predict future vegetation changes under different management strategies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.