Ryan Avery scite author profile

Remote sensing, or Earth Observation (EO), is increasingly used to understand Earth system dynamics and create continuous and categorical maps of biophysical properties and land cover, especially based on recent advances in machine learning (ML). ML models typically require large, spatially explicit training datasets to make accurate predictions. Training data (TD) are typically generated by digitizing polygons on high spatial-resolution imagery, by collecting in situ data, or by using pre-existing datasets. TD are often assumed to accurately represent the truth, but in practice almost always have error, stemming from (1) sample design, and (2) sample collection errors. The latter is particularly relevant for image-interpreted TD, an increasingly commonly used method due to its practicality and the increasing training sample size requirements of modern ML algorithms. TD errors can cause substantial errors in the maps created using ML algorithms, which may impact map use and interpretation. Despite these potential errors and their real-world consequences for map-based decisions, TD error is often not accounted for or reported in EO research. Here we review the current practices for collecting and handling TD. We identify the sources of TD error, and illustrate their impacts using several case studies representing different EO applications (infrastructure mapping, global surface flux estimates, and agricultural monitoring), and provide guidelines for minimizing and accounting for TD errors. To harmonize terminology, we distinguish TD from three other classes of data that should be used to create and assess ML models: training reference data, used to assess the quality of TD during data generation; validation data, used to iteratively improve models; and map reference data, used only for final accuracy assessment. We focus primarily on TD, but our advice is generally applicable to all four classes, and we ground our review in established best practices for map accuracy assessment literature. EO researchers should start by determining the tolerable levels of map error and appropriate error metrics. Next, TD error should be minimized during sample design by choosing a representative spatio-temporal collection strategy, by using spatially and temporally relevant imagery and ancillary data sources during TD creation, and by selecting a set of legend definitions supported by the data. Furthermore, TD error can be minimized during the collection of individual samples by using consensus-based collection strategies, by directly comparing interpreted training observations against expert-generated training reference data to derive TD error metrics, and by providing image interpreters with thorough application-specific training. We strongly advise that TD error is incorporated in model outputs, either directly in bias and variance estimates or, at a minimum, by documenting the sources and implications of error. TD should be fully documented and made available via an open TD repository, allowing others to replicate and assess its use. To guide researchers in this process, we propose three tiers of TD error accounting standards. Finally, we advise researchers to clearly communicate the magnitude and impacts of TD error on map outputs, with specific consideration given to the likely map audience.

show abstract

High Resolution, Annual Maps of Field Boundaries for Smallholder-Dominated Croplands at National Scales

Estes

Song

et al. 2022

Front. Artif. Intell.

View full text Add to dashboard Cite

Mapping the characteristics of Africa’s smallholder-dominated croplands, including the sizes and numbers of fields, can provide critical insights into food security and a range of other socioeconomic and environmental concerns. However, accurately mapping these systems is difficult because there is 1) a spatial and temporal mismatch between satellite sensors and smallholder fields, and 2) a lack of high-quality labels needed to train and assess machine learning classifiers. We developed an approach designed to address these two problems, and used it to map Ghana’s croplands. To overcome the spatio-temporal mismatch, we converted daily, high resolution imagery into two cloud-free composites (the primary growing season and subsequent dry season) covering the 2018 agricultural year, providing a seasonal contrast that helps to improve classification accuracy. To address the problem of label availability, we created a platform that rigorously assesses and minimizes label error, and used it to iteratively train a Random Forests classifier with active learning, which identifies the most informative training sample based on prediction uncertainty. Minimizing label errors improved model F1 scores by up to 25%. Active learning increased F1 scores by an average of 9.1% between first and last training iterations, and 2.3% more than models trained with randomly selected labels. We used the resulting 3.7 m map of cropland probabilities within a segmentation algorithm to delineate crop field boundaries. Using an independent map reference sample (n = 1,207), we found that the cropland probability and field boundary maps had respective overall accuracies of 88 and 86.7%, user’s accuracies for the cropland class of 61.2 and 78.9%, and producer’s accuracies of 67.3 and 58.2%. An unbiased area estimate calculated from the map reference sample indicates that cropland covers 17.1% (15.4–18.9%) of Ghana. Using the most accurate validation labels to correct for biases in the segmented field boundaries map, we estimated that the average size and total number of field in Ghana are 1.73 ha and 1,662,281, respectively. Our results demonstrate an adaptable and transferable approach for developing annual, country-scale maps of crop field boundaries, with several features that effectively mitigate the errors inherent in remote sensing of smallholder-dominated agriculture.

show abstract

Variability in urban population distributions across Africa

Tuholske

Caylor

Avery

2019

Environ. Res. Lett.

View full text Add to dashboard Cite

Africa is projected to add one billion urban residents by 2050. Yet developing sustainable solutions to tackle the host of challenges posed by rapid urban population growth is stymied by a lack municipality-level population data across the continent. To fill this gap, we intersect volunteered urban settlement data from OpenStreetMap with five synthetic gridded population datasets to estimate the how Africa's urban population is distributed among over 4750 individual urban settlements across Africa. We assess how urban settlement distributions changed from 2000 to 2015 within and between countries and across moisture zones. To this end, we construct urban settlement Lorenz curves to calculate change in Gini coefficients and test the degree to which Africa's urban settlements distributions fit power law distributions exhibited by Zipf's law. Our results reveal that 77%-85% of urban settlements in Africa have fewer than 100 000 people and that at least 50% of Africa's urban population live in urban settlements with fewer than 1 million residents. Across almost all African countries, the distribution of urban population shifted towards larger cities between 2000 and 2015. However, in arid regions, our results indicate that small-and medium-sized urban settlements are absorbing a greater share of urban population growth compared to large urban settlements. While our urban population estimates vary across gridded population datasets and differ from United Nations estimates, this is the first paper to measure urban population across Africa using a consistent methodology to identify urban settlement populations. Unlike UN urban population data for Africa, our results can readily be incorporated with geolocated environmental, public health, and economic data to support efforts to monitor United Nations Sustainable Development Goals related to urban sustainability, poverty reduction, and food security across Africa's ever-growing urban settlements.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ryan Avery

Accounting for Training Data Error in Machine Learning Applied to Earth Observations

High Resolution, Annual Maps of Field Boundaries for Smallholder-Dominated Croplands at National Scales

Variability in urban population distributions across Africa

Contact Info

Product

Resources

About