Soil provides a key interface between the atmosphere and the lithosphere and plays an important role in food production, ecosystem services, and biodiversity. Recently, demands for applying machine learning (ML) methods to improve the knowledge and understanding of soil behavior have increased. While real-world datasets are inherently imbalanced, ML models overestimate the majority classes and underestimate the minority ones. The aim of this study was to investigate the effects of imbalance in training data on the performance of a random forest model (RF). The original dataset (imbalanced) included 6100 soil texture data from the surface layer of agricultural fields in northern Iran. A synthetic resampling approach using the synthetic minority oversampling technique (SMOTE) was employed to make a balanced dataset from the original data. Bioclimatic and remotely sensed data, distance, and terrain attributes were used as environmental covariates to model and map soil textural classes. Results showed that based on mean minimal depth (MMD), when imbalanced data was used, distance and annual mean precipitation were important, but when balanced data were employed, terrain attributes and remotely sensed data played a key role in predicting soil texture. Balanced data also improved the accuracies from 44% to 59% and 0.30 to 0.52 with regard to the overall accuracy and kappa values, respectively. Similar increasing trends were observed for the recall and F-scores. It is concluded that, in modeling soil texture classes using RF models through a digital soil mapping approach, data should be balanced before modeling.
Recently, the demand for high-quality land use/land cover (LULC) information for near-real-time crop type mapping, in particular for multi-relief landscapes, has increased. While the LULC classes are inherently imbalanced, the statistics generally overestimate the majority classes and underestimate the minority ones. Therefore, the aim of this study was to assess the classes of the 10 m European Satellite Agency (ESA) WorldCover 2020 land use/land cover product with the support of the Google Earth Engine (GEE) in the Honam sub-basin, west Iran, using the LACOVAL (validation tool for regional-scale land cover and land cover change) online platform. The effect of imbalanced ground truth has also been explored. Four sampling schemes were employed on a total of 720 collected ground truth points over approximately 14,100 ha. The grassland and cropland totally canopied 94% of the study area, while barren land, shrubland, trees and built-up covered the rest. The results of the validation accuracy showed that the equalized sampling scheme was more realistically successful than the others in terms of roughly the same overall accuracy (91.6%), mean user’s accuracy (91.6%), mean producers’ accuracy (91.9%), mean partial portmanteau (91.9%) and kappa (0.9). The product was statistically improved to 93.5% ± 0.04 by the assembling approach and segmented with the help of supplementary datasets and visual interpretation. The findings confirmed that, in mapping LULC, data of classes should be balanced before accuracy assessment. It is concluded that the product is a reliable dataset for environmental modeling at the regional scale but needs some modifications for bare land and grassland classes in mountainous semi-arid regions of the globe.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.