Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological data sets, there is limited guidance on variable selection methods for RF modeling. Typically, either a preselected set of predictor variables are used or stepwise procedures are employed which iteratively remove variables according to their importance measures. This paper investigates the application of variable selection methods to RF models for predicting probable biological stream condition. Our motivating data set consists of the good/poor condition of n = 1365 stream survey sites from the 2008/2009 National Rivers and Stream Assessment, and a large set (p = 212) of landscape features from the StreamCat data set as potential predictors. We compare two types of RF models: a full variable set model with all 212 predictors and a reduced variable set model selected using a backward elimination approach. We assess model accuracy using RF's internal out-of-bag estimate, and a cross-validation procedure with validation folds external to the variable selection process. We also assess the stability of the spatial predictions generated by the RF models to changes in the number of predictors and argue that model selection needs to consider both accuracy and stability. The results suggest that RF modeling is robust to the inclusion of many variables of moderate to low importance. We found no substantial improvement in cross-validated accuracy as a result of variable reduction. Moreover, the backward elimination procedure tended to select too few variables and exhibited numerous issues such as upwardly biased out-of-bag accuracy estimates and instabilities in the spatial predictions. We use simulations to further support and generalize results from the analysis of real data. A main purpose of this work is to elucidate issues of model selection bias and instability to ecologists interested in using RF to develop predictive models with large environmental data sets.
Abstract.We propose various self-exciting point process models for the times when e-mails are sent between individuals in a social network. Using an EM-type approach, we fit these models to an e-mail network dataset from West Point Military Academy and the Enron e-mail dataset. We argue that the self-exciting models adequately capture major temporal clustering features in the data and perform better than traditional stationary Poisson models. We also investigate how accounting for diurnal and weekly trends in e-mail activity improves the overall fit to the observed network data.A motivation and application for fitting these self-exciting models is to use parameter estimates to characterize important e-mail communication behaviors such as the baseline sending rates, average reply rates, and average response times. A primary goal is to use these features, estimated from the self-exciting models, to infer the underlying leadership status of users in the West Point and Enron networks.
Understanding and mapping the spatial variation in stream biological condition could provide an important tool for conservation, assessment, and restoration of stream ecosystems. The USEPA's 2008-2009 National Rivers and Streams Assessment (NRSA) summarizes the percentage of stream lengths within the conterminous United States that are in good, fair, or poor biological condition based on a multimetric index of benthic invertebrate assemblages. However, condition is usually summarized at regional or national scales, and these assessments do not provide substantial insight into the spatial distribution of conditions at unsampled locations. We used random forests to model and predict the probable condition of several million kilometers of streams across the conterminous United States based on nearby and upstream landscape features, including human-related alterations to watersheds. To do so, we linked NRSA sample sites to the USEPA's StreamCat Dataset; a database of several hundred landscape metrics for all 1:100,000-scale streams and their associated watersheds within the conterminous United States. The StreamCat data provided geospatial indicators of nearby and upstream land use, land cover, climate, and other landscape features for modeling. Nationally, the model correctly predicted the biological condition class of 75% of NRSA sites. Although model evaluations suggested good discrimination among condition classes, we present maps as predicted probabilities of good condition, given upstream and nearby landscape settings. Inversely, the maps can be interpreted as the probability of a stream being in poor condition, given human-related watershed alterations. These predictions are available for download from the USEPA's StreamCat website. Finally, we illustrate how these predictions could be used to prioritize streams for conservation or restoration.
Environmental data may be "large" due to number of records, number of covariates, or both.Random forests has a reputation for good predictive performance when using many covariates with nonlinear relationships, whereas spatial regression, when using reduced rank methods, has a reputation for good predictive performance when using many records that are spatially autocorrelated. In this study, we compare these two techniques using a data set containing the macroinvertebrate multimetric index (MMI) at 1859 stream sites with over 200 landscape covariates. A primary application is mapping MMI predictions and prediction errors at 1.1 million perennial stream reaches across the conterminous United States. For the spatial regression model, we develop a novel transformation procedure that estimates Box-Cox transformations to linearize covariate relationships and handles possibly zero-inflated covariates. We find that the spatial regression model with transformations, and a subsequent selection of significant covariates, has cross-validation performance slightly better than random forests. We also find that prediction interval coverage is close to nominal for each method, but that spatial regression prediction intervals tend to be narrower and have less variability than quantile regression forest prediction intervals. A simulation study is used to generalize results and clarify advantages of each modeling approach.
Ecological and human health impairments related to excess nitrogen (N) in streams and rivers remain widespread in the United States (U.S.) despite recent efforts to reduce N pollution. Many studies have quantified the relationship between N loads to streams in terms of N mass and N inputs to watersheds; however, N concentrations, rather than loads, are more closely related to impacts on human health and aquatic life. Additionally, concentrations, rather than loads, trigger regulatory responses. In this study, we examined how N concentrations are related to N inputs to watersheds (atmospheric deposition, synthetic fertilizer, manure applied to agricultural land, cultivated biological N fixation, and point sources), land cover characteristics, and stream network characteristics, including stream size and the extent of lakes and reservoirs. N concentration data were collected across the conterminous U.S. during the U.S. Environmental Protection Agency's 2008-09 National Rivers and Streams Assessment (n = 1966). Median watershed N inputs were 15.7 kg N ha yr. Atmospheric deposition accounted for over half the N inputs in 49% of watersheds, but watersheds with the highest N input rates were dominated by agriculture-related sources. Total N input to watersheds explained 42% and 38% of the variability in total N and dissolved inorganic N concentrations, respectively. Land cover characteristics were also important predictors, with wetland cover muting the effect of agricultural N inputs on N concentrations and riparian disturbance exacerbating it. In contrast, stream variables showed little correlation with N concentrations. This suggests that terrestrial factors that can be managed, such as agricultural N use practices and wetland or riparian areas, control the spatial variability in stream N concentrations across the conterminous U.S.
Social media data tend to cluster around events and themes. Local newsworthy events, sports team victories or defeats, abnormal weather patterns and globally trending topics all influence the content of online discussion. The automated discovery of these underlying themes from corpora of text is of interest to numerous academic fields as well as to law enforcement organizations and commercial users. One useful class of tools to deal with such problems are topic models, which attempt to recover latent groups of word associations from the text. However, it is clear that these topics may also exhibit patterns in both time and space. The recovery of such patterns complements the analysis of the text itself and in many cases provides additional context. In this work we describe two methods for mining interesting spatio-temporal dynamics and relations among topics, one that compares the topic distributions as histograms in space and time and another that models topics over time as temporal or spatio-temporal Hawkes process with exponential trigger functions. Both methods may be used to discover topics with abnormal distributions in space and time. The second method also allows for self-exciting topics and can recover intertopic relationships (excitation or inhibition) in both time and space. We apply these methods to a geo-tagged Twitter dataset and provide analysis and discussion of the results.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.