2018
DOI: 10.3847/1538-3881/aaf101
|View full text |Cite
|
Sign up to set email alerts
|

Probabilistic Random Forest: A Machine Learning Algorithm for Noisy Data Sets

Abstract: Machine learning (ML) algorithms become increasingly important in the analysis of astronomical data. However, since most ML algorithms are not designed to take data uncertainties into account, ML based studies are mostly restricted to data with high signal-to-noise ratio. Astronomical datasets of such high-quality are uncommon. In this work we modify the long-established Random Forest (RF) algorithm to take into account uncertainties in the measurements (i.e., features) as well as in the assigned classes (i.e.… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

6
80
0
1

Year Published

2019
2019
2023
2023

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 127 publications
(116 citation statements)
references
References 51 publications
(58 reference statements)
6
80
0
1
Order By: Relevance
“…Interestingly however, we find that our networks are robust to contaminated labels; only minor degradation in overall performance is experienced up to a contamination fraction of 0.48, after which performance decreases rapidly. This result is consistent with those from other studies (Rolnick et al 2017;Li et al 2019;Reis et al 2019) and suggests that label contamination in real data is of little consequence to overall performance.…”
Section: Discussionsupporting
confidence: 93%
See 1 more Smart Citation
“…Interestingly however, we find that our networks are robust to contaminated labels; only minor degradation in overall performance is experienced up to a contamination fraction of 0.48, after which performance decreases rapidly. This result is consistent with those from other studies (Rolnick et al 2017;Li et al 2019;Reis et al 2019) and suggests that label contamination in real data is of little consequence to overall performance.…”
Section: Discussionsupporting
confidence: 93%
“…The loss in accuracy up to 45% contamination was ∼4%. Reis et al (2019) perform the same experiment for probabilistic random forests and found a loss of less than 5% when more than 45% of their dataset had incorrect labels, in-line with the performance drop we find. Evidently label contamination does hinder performance, but the network is robust to small contamination fractions.…”
Section: Label Noisesupporting
confidence: 76%
“…Currently, in our experiment, about 35% are lacking a photo-z for this reason. However, the use of fluxes instead of magnitudes and adding the photometric errors to the list of parameters (using e.g., Reis et al 2019) should solve the issue.…”
Section: Discussionmentioning
confidence: 99%
“…The photometric catalogues used in Stripe 82X are relatively shallow, implying that the fainter sources in general have a large photometric error. Unlike SED-fitting, until recently ML techniques could not handle the errors associated to the measurements and the same weight was incorrectly assumed for each photometric value (but see Reis et al 2019, for a counter example application). We tried to assess how this is impacting on the result, by reducing BE ST magcolopt to subsamples of decreasing photometric errors in all the bands.…”
Section: Impact Of Photometric Errorsmentioning
confidence: 99%
“…To obtain better recovery fractions using spectra directly, better methods to compute the distance between spectra will be necessary (e.g. Reis et al 2019).…”
Section: Comparison Of Chemical Spacesmentioning
confidence: 99%