913 may also lead to dependence between species (phylogenetic structure) or populations of species (genetic structure) with more recent divergence will tend to be more similar than those which diverged longer ago (Harvey and Pagel 1991). While such underlying structures in the data are not fundamentally problematic for statistical analyses, they tend to create two undesirable outcomes. First, model error, as well as neglected processes and variables connected to these structures, often leads to dependence structures in the model residuals, which violates the critical assumption of independence present in many models and methods (Legendre and Fortin 1989, Miller et al. 2007). Second, because predictor variables are often correlated with underlying dependence structures (e.g. climate with space), models may use predic-tors to overfit the residual dependence structure and thereby remove it, partially or completely.
Species distribution models (SDMs) are used to inform a range of ecological, biogeographical and conservation applications. However, users often underestimate the strong links between data type, model output and suitability for end-use. We synthesize current knowledge and provide a simple framework that summarizes how interactions between data type and the sampling process (i.e. imperfect detection and sampling bias) determine the quantity that is estimated by a SDM. We then draw upon the published literature and simulations to illustrate and evaluate the information needs of the most common ecological, biogeographical and conservation applications of SDM outputs. We find that, while predictions of models fitted to the most commonly available observational data (presence records) suffice for some applications, others require estimates of occurrence probabilities, which are unattainable without reliable absence records. Our literature review and simulations reveal that, while converting continuous SDM outputs into categories of assumed presence or absence is common practice, it is seldom clearly justified by the application's objective and it usually degrades inference. Matching SDMs to the needs of particular applications is critical to avoid poor scientific inference and management outcomes. This paper aims to help modellers and users assess whether their intended SDM outputs are indeed fit for purpose.
Species distribution models (SDMs) constitute the most common class of models across ecology, evolution and conservation. The advent of ready‐to‐use software packages and increasing availability of digital geoinformation have considerably assisted the application of SDMs in the past decade, greatly enabling their broader use for informing conservation and management, and for quantifying impacts from global change. However, models must be fit for purpose, with all important aspects of their development and applications properly considered. Despite the widespread use of SDMs, standardisation and documentation of modelling protocols remain limited, which makes it hard to assess whether development steps are appropriate for end use. To address these issues, we propose a standard protocol for reporting SDMs, with an emphasis on describing how a study's objective is achieved through a series of modeling decisions. We call this the ODMAP (Overview, Data, Model, Assessment and Prediction) protocol, as its components reflect the main steps involved in building SDMs and other empirically‐based biodiversity models. The ODMAP protocol serves two main purposes. First, it provides a checklist for authors, detailing key steps for model building and analyses, and thus represents a quick guide and generic workflow for modern SDMs. Second, it introduces a structured format for documenting and communicating the models, ensuring transparency and reproducibility, facilitating peer review and expert evaluation of model quality, as well as meta‐analyses. We detail all elements of ODMAP, and explain how it can be used for different model objectives and applications, and how it complements efforts to store associated metadata and define modelling standards. We illustrate its utility by revisiting nine previously published case studies, and provide an interactive web‐based application to facilitate its use. We plan to advance ODMAP by encouraging its further refinement and adoption by the scientific community.
With the expansion in the quantity and types of biodiversity data being collected, there is a need to find ways to combine these different sources to provide cohesive summaries of species' potential and realized distributions in space and time. Recently, model-based data integration has emerged as a means to achieve this by combining datasets in ways that retain the strengths of each. We describe a flexible approach to data integration using point process models, which provide a convenient way to translate across ecological currencies. We highlight recent examples of large-scale ecological models based on data integration and outline the conceptual and technical challenges and opportunities that arise. Species Distribution Models in EcologyLarge-scale ecological models of how species distributions and abundances vary over space and time are a critical tool in macroecology, biogeography, and conservation biology. They underpin our understanding of how biodiversity is shaped, how it is responding to anthropogenic activities, and how it might change in the future [1][2][3]. There is now a substantial literature on statistical tools for building species distribution models (SDMs) (see Glossary) and best practice in how to fit them [4][5][6][7]. SDMs also form a building block upon which more complex models, incorporating occupancy and/or abundance in space and time, can be built [8,9].
Aim:The idea of combining predictions from different models into an ensemble has gained considerable popularity in species distribution modelling, partly due to free and comprehensive software such as the R package BIOMOD. However, despite proliferation of ensemble models, we lack oversight of how and where they are used for modelling distributions, and how well they perform. Here, we present such an overview. Location: Global.Methods: Since BIOMOD is freely available and widely used by ensemble species distribution modellers, we focused on articles that apply BIOMOD, filtering the initial 852 papers identified in our structured literature search to a relevant final subset of 224 eligible peer-reviewed journal articles. Results: BIOMOD-based ensembles are used across many taxa and locations, with terrestrial plants being the most represented group of species (n = 72) and Europe being the most represented continent (n = 106). These studies often focus on forecasting distributions in the future (n = 109), and commonly use presence-only species data (n = 139) and climatic environmental predictors (n = 219). An average of six models are used in ensembles, and approximately half of ensembles weight contributions of models by their cross-validation performance. However, discussion about choices made in the modelling process and unambiguous information on the performance of ensemble models versus individual models are limited. The use of independent data to validate model performance is particularly uncommon. Main conclusions:We document the breadth of ensemble applications, but could not draw strong quantitative conclusions about the predictive performance of ensemble models, due to lack of unambiguous information reported. Understanding how and where ensembles are best used when modelling species distributions is important for enabling best choices for different applications. To enable this objective to be achieved, we provide recommendations for thorough reporting practices in a BIOMOD-based ensemble workflow. K E Y W O R D SBIOMOD, consensus forecast, ecological niche models, ensemble, habitat suitability models, species distribution model
Aim Species often remain undetected at sites where they are present. However, the impact of imperfect detection on species distribution models (SDMs) is not fully appreciated. In this paper we evaluate the influence of imperfect detection on the calibration and discrimination capacity of SDMs. We compare the performance of three types of SDMs: (1) a technique based on presence-absence data, (2) a technique based on presence-background data, and (3) a technique based on detection/ non-detection data that accounts for imperfect detection. InnovationWe use simulations to evaluate the impacts of imperfect detection in SDMs. This allows us to assess model performance with respect to the true objective of the models: the estimation of species distributions. We study a range of scenarios of occupancy and detection based on ecologically plausible environmental relationships and identify the circumstances in which imperfect detection affects model calibration and discrimination. We show that imperfect detection can substantially reduce the inferential and predictive accuracy of presence-absence and presencebackground methods that do not account for detectability. While calibration is always affected, the influence on discrimination depends on the relationship of detectability and environmental variables. Main conclusionsThe performance of a model should be assessed with respect to its objectives. Comparative studies that intend to assess the performance of an SDM by evaluating its ability to predict detections rather than presences fail to reveal the benefits of accounting for detectability. Disregarding imperfect detection can have severe consequences for SDM performance, and hence for the estimation of species distributions. To date, this issue has been largely ignored in the SDM literature. Simultaneously modelling occupancy and detection does not necessarily require a greater sampling effort, but rather that data are collected so that they are informative about detectability. We recommend that consideration of imperfect detection become standard practice for species distribution modelling.
Species distribution modeling (SDM) is widely used in ecology and conservation. Currently, the most available data for SDM are species presence-only records (available through digital databases). There have been many studies comparing the performance of alternative algorithms for modeling presence-only data. Among these, a 2006 paper from Elith and colleagues has been particularly influential in the field, partly because they used several novel methods (at the time) on a global data set that included independent presence-absence records for model evaluation. Since its publication, some of the algorithms have been further developed and new ones have emerged. In this paper, we explore patterns in predictive performance across methods, by reanalyzing the same data set (225 species from six different regions) using updated modeling knowledge and practices. We apply well-established methods such as generalized additive models and MaxEnt, alongside others that have received attention more recently, including regularized regressions, point-process weighted regressions, random forests, XGBoost, support vector machines, and the ensemble modeling framework biomod. All the methods we use include background samples (a sample of environments in the landscape) for model fitting. We explore impacts of using weights on the presence and background points in model fitting. We introduce new ways of evaluating models fitted to these data, using the area under the precision-recall gain curve, and focusing on the rank of results. We find that the way models are fitted matters. The top method was an ensemble of tuned individual models. In contrast, ensembles built using the biomod framework with default parameters performed no better than single moderate performing models. Similarly, the second top performing method was a random forest parameterized to deal with many background samples (contrasted to relatively few presence records), which substantially outperformed other random forest implementations. We find that, in general, nonparametric techniques with the capability of controlling for model complexity outperformed traditional regression methods, with MaxEnt and boosted regression trees still among the top performing models. All the data and code with working examples are provided to make this study fully reproducible.
In ecology, the true causal structure for a given problem is often not known, and several plausible models and thus model predictions exist. It has been claimed that using weighted averages of these models can reduce prediction error, as well as better reflect model selection uncertainty. These claims, however, are often demonstrated by isolated examples. Analysts must better understand under which conditions model averaging can improve predictions and their uncertainty estimates. Moreover, a large range of different model averaging methods exists, raising the question of how they differ in their behaviour and performance. Here, we review the mathematical foundations of model averaging along with the diversity of approaches available. We explain that the error in model‐averaged predictions depends on each model's predictive bias and variance, as well as the covariance in predictions between models, and uncertainty about model weights. We show that model averaging is particularly useful if the predictive error of contributing model predictions is dominated by variance, and if the covariance between models is low. For noisy data, which predominate in ecology, these conditions will often be met. Many different methods to derive averaging weights exist, from Bayesian over information‐theoretical to cross‐validation optimized and resampling approaches. A general recommendation is difficult, because the performance of methods is often context dependent. Importantly, estimating weights creates some additional uncertainty. As a result, estimated model weights may not always outperform arbitrary fixed weights, such as equal weights for all models. When averaging a set of models with many inadequate models, however, estimating model weights will typically be superior to equal weights. We also investigate the quality of the confidence intervals calculated for model‐averaged predictions, showing that they differ greatly in behaviour and seldom manage to achieve nominal coverage. Our overall recommendations stress the importance of non‐parametric methods such as cross‐validation for a reliable uncertainty quantification of model‐averaged predictions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.