In risk evaluation, the effect of mixtures of environmental chemicals on a common adverse outcome is of interest. However, due to the high dimensionality and inherent correlations among chemicals that occur together, the traditional methods (e.g. ordinary or logistic regression) suffer from collinearity and variance inflation, and shrinkage methods have limitations in selecting among correlated components. We propose a weighted quantile sum (WQS) approach to estimating a body burden index, which identifies “bad actors” in a set of highly correlated environmental chemicals. We evaluate and characterize the accuracy of WQS regression in variable selection through extensive simulation studies through sensitivity and specificity (i.e., ability of the WQS method to select the bad actors correctly and not incorrect ones). We demonstrate the improvement in accuracy this method provides over traditional ordinary regression and shrinkage methods (lasso, adaptive lasso, and elastic net). Results from simulations demonstrate that WQS regression is accurate under some environmentally relevant conditions, but its accuracy decreases for a fixed correlation pattern as the association with a response variable diminishes. Nonzero weights (i.e., weights exceeding a selection threshold parameter) may be used to identify bad actors; however, components within a cluster of highly correlated active components tend to have lower weights, with the sum of their weights representative of the set.
Geographically weighted regression, Multicollinearity, Local regression diagnostics, Spatial eigenvectors, Experimental spatial design,
Geographically weighted regression (GWR) is drawing attention as a statistical method to estimate regression models with spatially varying relationships between explanatory variables and a response variable. Local collinearity in weighted explanatory variables leads to GWR coefficient estimates that are correlated locally and across space, have inflated variances, and are at times counterintuitive and contradictory in sign to the global regression estimates. The presence of local collinearity in the absence of global collinearity necessitates the use of diagnostic tools in the local regression model building process to highlight areas in which the results are not reliable for statistical inference. The method of ridge regression can also be integrated into the GWR framework to constrain and stabilize regression coefficients and lower prediction error. This paper presents numerous diagnostic tools and ridge regression in GWR and demonstrates the utility of these techniques with an example using the Columbus crime dataset.
In evaluation of cancer risk related to environmental chemical exposures, the effect of many chemicals on disease is ultimately of interest. However, because of potentially strong correlations among chemicals that occur together, traditional regression methods suffer from collinearity effects, including regression coefficient sign reversal and variance inflation. In addition, penalized regression methods designed to remediate collinearity may have limitations in selecting the truly bad actors among many correlated components. The recently proposed method of weighted quantile sum (WQS) regression attempts to overcome these problems by estimating a body burden index, which identifies important chemicals in a mixture of correlated environmental chemicals. Our focus was on assessing through simulation studies the accuracy of WQS regression in detecting subsets of chemicals associated with health outcomes (binary and continuous) in site-specific analyses and in non-site-specific analyses. We also evaluated the performance of the penalized regression methods of lasso, adaptive lasso, and elastic net in correctly classifying chemicals as bad actors or unrelated to the outcome. We based the simulation study on data from the National Cancer Institute Surveillance Epidemiology and End Results Program (NCI-SEER) case–control study of non-Hodgkin lymphoma (NHL) to achieve realistic exposure situations. Our results showed that WQS regression had good sensitivity and specificity across a variety of conditions considered in this study. The shrinkage methods had a tendency to incorrectly identify a large number of components, especially in the case of strong association with the outcome.
Large variability and correlations among the coefficients obtained from the method of geographically weighted regression (GWR) have been identified in previous research. This is an issue that poses a serious challenge for the utility of the method as a tool to investigate multivariate relationships. The objectives of this paper are to assess: (1) the ability of GWR to discriminate between a spatially constant processes and one with spatially varying relationships; and (2) to accurately retrieve spatially varying relationships. Extensive numerical experiments are used to investigate situations where the underlying process is stationary and nonstationary, and to assess the degree to which spurious intereoefTieient correlations are introduced. Two different implementations of GWR and cross-validation approaches are assessed. Results suggest that judicious application of GWR can be used to discern whether the underlying proeess is nonstationary. Furthermore, evidence of spurious correlations indicates that caution must be exercised when drawing conclusions regarding spatial relationships retrieved using this approach, particularly when working with small samples.
Intense demand for water in the Central Valley of California and related increases in groundwater nitrate concentration threaten the sustainability of the groundwater resource. To assess contamination risk in the region, we developed a hybrid, non-linear, machine learning model within a statistical learning framework to predict nitrate contamination of groundwater to depths of approximately 500m below ground surface. A database of 145 predictor variables representing well characteristics, historical and current field and landscape-scale nitrogen mass balances, historical and current land use, oxidation/reduction conditions, groundwater flow, climate, soil characteristics, depth to groundwater, and groundwater age were assigned to over 6000 private supply and public supply wells measured previously for nitrate and located throughout the study area. The boosted regression tree (BRT) method was used to screen and rank variables to predict nitrate concentration at the depths of domestic and public well supplies. The novel approach included as predictor variables outputs from existing physically based models of the Central Valley. The top five most important predictor variables included two oxidation/reduction variables (probability of manganese concentration to exceed 50ppb and probability of dissolved oxygen concentration to be below 0.5ppm), field-scale adjusted unsaturated zone nitrogen input for the 1975 time period, average difference between precipitation and evapotranspiration during the years 1971-2000, and 1992 total landscape nitrogen input. Twenty-five variables were selected for the final model for log-transformed nitrate. In general, increasing probability of anoxic conditions and increasing precipitation relative to potential evapotranspiration had a corresponding decrease in nitrate concentration predictions. Conversely, increasing 1975 unsaturated zone nitrogen leaching flux and 1992 total landscape nitrogen input had an increasing relative impact on nitrate predictions. Three-dimensional visualization indicates that nitrate predictions depend on the probability of anoxic conditions and other factors, and that nitrate predictions generally decreased with increasing groundwater age.
In the field of spatial analysis, the interest of some researchers in modeling relationships between variables locally has led to the development of regression models with spatially varying coefficients. One such model that has been widely applied is geographically weighted regression (GWR). In the application of GWR, marginal inference on the spatial pattern of regression coefficients is often of interest, as is, less typically, prediction and estimation of the response variable. Empirical research and simulation studies have demonstrated that local correlation in explanatory variables can lead to estimated regression coefficients in GWR that are strongly correlated and, hence, problematic for inference on relationships between variables. The author introduces a penalized form of GWR, called the`geographically weighted lasso' (GWL) which adds a constraint on the magnitude of the estimated regression coefficients to limit the effects of explanatory-variable correlation. The GWL also performs local model selection by potentially shrinking some of the estimated regression coefficients to zero in some locations of the study area. Two versions of the GWL are introduced: one designed to improve prediction of the response variable, and one more oriented toward constraining regression coefficients for inference. The results of applying the GWL to simulated and real datasets show that this method stabilizes regression coefficients in the presence of collinearity and produces lower prediction and estimation error of the response variable than does GWR and another constrained version of GWRögeographically weighted ridge regression.
Background: Spatial cluster detection is an important tool in cancer surveillance to identify areas of elevated risk and to generate hypotheses about cancer etiology. There are many cluster detection methods used in spatial epidemiology to investigate suspicious groupings of cancer occurrences in regional count data and case-control data, where controls are sampled from the atrisk population. Numerous studies in the literature have focused on childhood leukemia because of its relatively large incidence among children compared with other malignant diseases and substantial public concern over elevated leukemia incidence. The main focus of this paper is an analysis of the spatial distribution of leukemia incidence among children from 0 to 14 years of age in Ohio from 1996-2003 using individual case data from the Ohio Cancer Incidence Surveillance System (OCISS).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.