2015
DOI: 10.3390/rs70708489
|View full text |Cite
|
Sign up to set email alerts
|

On the Importance of Training Data Sample Selection in Random Forest Image Classification: A Case Study in Peatland Ecosystem Mapping

Abstract: Random Forest (RF) is a widely used algorithm for classification of remotely sensed data. Through a case study in peatland classification using LiDAR derivatives, we present an analysis of the effects of input data characteristics on RF classifications (including RF out-ofbag error, independent classification accuracy and class proportion error). Training data selection and specific input variables (i.e., image channels) have a large impact on the overall accuracy of the image classification. High-dimension da… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

10
325
2
4

Year Published

2016
2016
2022
2022

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 467 publications
(399 citation statements)
references
References 35 publications
10
325
2
4
Order By: Relevance
“…However, the training data should be distributed evenly in geographic space to avoid generating spurious classification accuracies (Friedl, Brodley, and Strahler 1999), which is not possible in this study as the MODIS active fire detections are sparsely distributed. In addition, unlike supervised land cover classification approaches, where training sample points inherently have high spatial autocorrelation due to the way they are collected (Egorov et al 2015;Millard and Richardson 2015), the training data in this study were derived from a random subset of the MODIS active fire detections and therefore are less likely to be spatially autocorrelated. As observed in similar regional fire-related random forest studies (Archibald et al 2009;Oliveira et al 2012), spatial autocorrelation of predictor variables may occur due to various physical and biological processes but no technique to incorporate spatial dependence has been reliably demonstrated and this remains an area of active research.…”
Section: Discussionmentioning
confidence: 99%
“…However, the training data should be distributed evenly in geographic space to avoid generating spurious classification accuracies (Friedl, Brodley, and Strahler 1999), which is not possible in this study as the MODIS active fire detections are sparsely distributed. In addition, unlike supervised land cover classification approaches, where training sample points inherently have high spatial autocorrelation due to the way they are collected (Egorov et al 2015;Millard and Richardson 2015), the training data in this study were derived from a random subset of the MODIS active fire detections and therefore are less likely to be spatially autocorrelated. As observed in similar regional fire-related random forest studies (Archibald et al 2009;Oliveira et al 2012), spatial autocorrelation of predictor variables may occur due to various physical and biological processes but no technique to incorporate spatial dependence has been reliably demonstrated and this remains an area of active research.…”
Section: Discussionmentioning
confidence: 99%
“…OOB error was used as a basis for comparison of classifications to determine optimum input parameters, years and seasons (Table 4) as described below. OOB error has been shown to be optimistic compared to independent sample validation accuracy [75,80], but when applied consistently in the same manner, it can be an efficient way to compare classifications and conduct variable selection. It was preferred over independent validation for this study given: (1) the field generated reference data set sample size was limited due to poor accessibility to all parts of the wetlands; and (2) both the field and image-based reference samples follow the general arcuate shape of the wetlands and were probably spatially auto-correlated.…”
Section: Image Classificationmentioning
confidence: 99%
“…Although RF is generally considered robust to overfitting [106], it is highly likely that overall classification accuracy levels reported in this study were overly optimistic since the RF accuracy assessment was obtained from the "Out-of-Bag" (OOB) accuracy estimate, which is known to represent inflated accuracy [75,80]. OOB accuracy is useful for comparison of multiple classification models as in this study but independent validation is required to determine absolute accuracy.…”
Section: Limitations and Recommendations For Future Mapping Of Wetlandsmentioning
confidence: 99%
See 1 more Smart Citation
“…2018, 10, 50 3 of 25 intensive for computation. Moreover, a machine learning algorithm-random forests (RF) has been widely used for classification of LULC types [21][22][23][24][25][26][27]. This method has the ability of optimizing both classification results and selection of remote sensing variables.…”
Section: Introductionmentioning
confidence: 99%