[1] This paper investigates the actual extrapolation capacity of three hydrological models in differing climate conditions. We propose a general testing framework, in which we perform series of split-sample tests, testing all possible combinations of calibration-validation periods using a 10 year sliding window. This methodology, which we have called the generalized split-sample test (GSST), provides insights into the model's transposability over time under various climatic conditions. The three conceptual rainfall-runoff models yielded similar results over a set of 216 catchments in southeast Australia. First, we assessed the model's efficiency in validation using a criterion combining the root-mean-square error and bias. A relation was found between this efficiency and the changes in mean rainfall (P) but not with changes in mean potential evapotranspiration (PE) or air temperature (T). Second, we focused on average runoff volumes and found that simulation biases are greatly affected by changes in P. Calibration over a wetter (drier) climate than the validation climate leads to an overestimation (underestimation) of the mean simulated runoff. We observed different magnitudes of these models deficiencies depending on the catchment considered. Results indicate that the transfer of model parameters in time may introduce a significant level of errors in simulations, meaning increased uncertainty in the various practical applications of these models (flow simulation, forecasting, design, reservoir management, climate change impact assessments, etc.). Testing model robustness with respect to this issue should help better quantify these uncertainties.
Reliable and precise probabilistic prediction of daily catchment‐scale streamflow requires statistical characterization of residual errors of hydrological models. This study focuses on approaches for representing error heteroscedasticity with respect to simulated streamflow, i.e., the pattern of larger errors in higher streamflow predictions. We evaluate eight common residual error schemes, including standard and weighted least squares, the Box‐Cox transformation (with fixed and calibrated power parameter λ) and the log‐sinh transformation. Case studies include 17 perennial and 6 ephemeral catchments in Australia and the United States, and two lumped hydrological models. Performance is quantified using predictive reliability, precision, and volumetric bias metrics. We find the choice of heteroscedastic error modeling approach significantly impacts on predictive performance, though no single scheme simultaneously optimizes all performance metrics. The set of Pareto optimal schemes, reflecting performance trade‐offs, comprises Box‐Cox schemes with λ of 0.2 and 0.5, and the log scheme (λ = 0, perennial catchments only). These schemes significantly outperform even the average‐performing remaining schemes (e.g., across ephemeral catchments, median precision tightens from 105% to 40% of observed streamflow, and median biases decrease from 25% to 4%). Theoretical interpretations of empirical results highlight the importance of capturing the skew/kurtosis of raw residuals and reproducing zero flows. Paradoxically, calibration of λ is often counterproductive: in perennial catchments, it tends to overfit low flows at the expense of abysmal precision in high flows. The log‐sinh transformation is dominated by the simpler Pareto optimal schemes listed above. Recommendations for researchers and practitioners seeking robust residual error schemes for practical work are provided.
Testing hydrological models under changing conditions is essential to evaluate their ability to cope with changing catchments and their suitability for impact studies. With this perspective in mind, a workshop dedicated to this issue was held at the 2013 General Assembly of the International Association of Hydrological Sciences (IAHS) in Göteborg, Sweden, in July 2013, during which the results of a common testing experiment were presented. Prior to the workshop, the participants had been invited to test their own models on a common set of basins showing varying conditions specifically set up for the workshop. All these basins experienced changes, either in physical characteristics (e.g. changes in land cover) or climate conditions (e.g. gradual temperature increase). This article presents the motivations and organization of this experiment-that is-the testing (calibration and evaluation) protocol and the common framework of statistical procedures and graphical tools used to assess the model performances. The basins datasets are also briefly introduced (a detailed description is provided in the associated Supplementary material).
Abstract. As all hydrological models are intrinsically limited hypotheses on the behaviour of catchments, models -which attempt to represent real-world behaviour -will always remain imperfect. To make progress on the long road towards improved models, we need demanding tests, i.e. true crash tests. Efficient testing requires large and varied data sets to develop and assess hydrological models, to ensure their generality, to diagnose their failures, and ultimately, help improving them.
All that glitters is not gold is one of those universal truths that also applies to hydrology, and particularly to the issue of model calibration, where a glittering mathematical optimum is too often mistaken for a hydrological optimum. This commentary aims at underlining the fact that calibration difficulties have not disappeared with the advent of the latest search algorithms. While it is true that progress on the numerical front has allowed us to quasi-eradicate miscalibration issues, we still too often underestimate the remaining hydrological task: screening mathematical optima in order to identify those parameter sets which will also work sufficiently outside the calibration period.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.