Species distribution modeling (SDM) is widely used in ecology and conservation. Currently, the most available data for SDM are species presence-only records (available through digital databases). There have been many studies comparing the performance of alternative algorithms for modeling presence-only data. Among these, a 2006 paper from Elith and colleagues has been particularly influential in the field, partly because they used several novel methods (at the time) on a global data set that included independent presence-absence records for model evaluation. Since its publication, some of the algorithms have been further developed and new ones have emerged. In this paper, we explore patterns in predictive performance across methods, by reanalyzing the same data set (225 species from six different regions) using updated modeling knowledge and practices. We apply well-established methods such as generalized additive models and MaxEnt, alongside others that have received attention more recently, including regularized regressions, point-process weighted regressions, random forests, XGBoost, support vector machines, and the ensemble modeling framework biomod. All the methods we use include background samples (a sample of environments in the landscape) for model fitting. We explore impacts of using weights on the presence and background points in model fitting. We introduce new ways of evaluating models fitted to these data, using the area under the precision-recall gain curve, and focusing on the rank of results. We find that the way models are fitted matters. The top method was an ensemble of tuned individual models. In contrast, ensembles built using the biomod framework with default parameters performed no better than single moderate performing models. Similarly, the second top performing method was a random forest parameterized to deal with many background samples (contrasted to relatively few presence records), which substantially outperformed other random forest implementations. We find that, in general, nonparametric techniques with the capability of controlling for model complexity outperformed traditional regression methods, with MaxEnt and boosted regression trees still among the top performing models. All the data and code with working examples are provided to make this study fully reproducible.
23 1. When applied to structured data, conventional random cross-validation techniques can 24 lead to underestimation of prediction error, and may result in inappropriate model 25 selection. 26 2. We present the R package blockCV, a new toolbox for cross-validation of species 27 distribution modelling. 28 3. The package can generate spatially or environmentally separated folds. It includes tools 29to measure spatial autocorrelation ranges in candidate covariates, providing the user with 30 insights into the spatial structure in these data. It also offers interactive graphical 31 capabilities for creating spatial blocks and exploring data folds. 32 4. Package blockCV enables modellers to more easily implement a range of evaluation 33 approaches. It will help the modelling community learn more about the impacts of 34 evaluation approaches on our understanding of predictive performance of species 35 distribution models. 36 37
When applied to structured data, conventional random cross‐validation techniques can lead to underestimation of prediction error, and may result in inappropriate model selection.
We present the r package blockCV, a new toolbox for cross‐validation of species distribution modelling. Although it has been developed with species distribution modelling in mind, it can be used for any spatial modelling.
The package can generate spatially or environmentally separated folds. It includes tools to measure spatial autocorrelation ranges in candidate covariates, providing the user with insights into the spatial structure in these data. It also offers interactive graphical capabilities for creating spatial blocks and exploring data folds.
Package blockCV enables modellers to more easily implement a range of evaluation approaches. It will help the modelling community learn more about the impacts of evaluation approaches on our understanding of predictive performance of species distribution models.
The random forest (RF) algorithm is an ensemble of classification or regression trees and is widely used, including for species distribution modelling (SDM). Many researchers use implementations of RF in the R programming language with default parameters to analyse species presence‐only data together with ‘background' samples. However, there is good evidence that RF with default parameters does not perform well for such ‘presence‐background' modelling. This is often attributed to the disparity between the number of presence and background samples, also known as 'class imbalance', and several solutions have been proposed. Here, we first set the context: the background sample should be large enough to represent all environments in the region. We then aim to understand the drivers of poor performance of RF when models are fitted to presence‐only species data alongside background samples. We show that 'class overlap' (where both classes occur in the same environment) is an important driver of poor performance, alongside class imbalance. Class overlap can even degrade performance for presence–absence data. We explain, test and evaluate suggested solutions. Using simulated and real presence‐background data, we compare performance of default RF with other weighting and sampling approaches. Our results demonstrate clear evidence of improvement in the performance of RFs when techniques that explicitly manage imbalance are used. We show that these either limit or enforce tree depth. Without compromising the environmental representativeness of the sampled background, we identify approaches to fitting RF that ameliorate the effects of imbalance and overlap and allow excellent predictive performance. Understanding the problems of RF in presence‐background modelling allows new insights into how best to fit models, and should guide future efforts to best deal with such data.
Aim: After environmental disasters, species with large population losses may need urgent protection to prevent extinction and support recovery. Following the 2019-2020 Australian megafires, we estimated population losses and recovery in fire-affected fauna, to inform conservation status assessments and management.Location: Temperate and subtropical Australia.
Time period: 2019-2030 and beyond.Major taxa: Australian terrestrial and freshwater vertebrates; one invertebrate group.
Methods:From > 1,050 fire-affected taxa, we selected 173 whose distributions substantially overlapped the fire extent. We estimated the proportion of each taxon's distribution affected by fires, using fire severity and aquatic impact mapping, and new distribution mapping. Using expert elicitation informed by evidence of responses to previous wildfires, we estimated local population responses to fires of varying severity. We combined the spatial and elicitation data to estimate overall population loss and recovery trajectories, and thus indicate potential eligibility for listing as threatened, or uplisting, under Australian legislation.
Results:We estimate that the 2019-2020 Australian megafires caused, or contributed to, population declines that make 70-82 taxa eligible for listing as threatened;
Species distribution models (SDMs) are widely used to predict and study distributions of species. Many different modeling methods and associated algorithms are used and continue to emerge. It is important to understand how different approaches perform, particularly when applied to species occurrence records that were not gathered in structured surveys (e.g. opportunistic records). This need motivated a large-scale, collaborative effort, published in 2006, that aimed to create objective comparisons of algorithm performance. As a benchmark, and to facilitate future comparisons of approaches, here we publish that dataset: point location records for 226 anonymized species from six regions of the world, with accompanying predictor variables in raster (grid) and point formats. A particularly interesting characteristic of this dataset is that independent presence-absence survey data are available for evaluation alongside the presence-only species occurrence data intended for modeling. The dataset is available on Open Science Framework and as an R package and can be used as a benchmark for modeling approaches and for testing new ways to evaluate the accuracy of SDMs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.