The Random Forest Kernel and other kernels for big data from random partitions

Davies, A.E.; Ghahramani, Zoubin

doi:10.48550/arxiv.1402.4293

Cited by 19 publications

(32 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MAPLE fits a regression forest to the outputs of a black-box model, and then uses a feature importance selector called DSTUMP (Kazemitabar et al 2017) to select the most important features. When an explanation is desirable, MAPLE uses SILO (Bloniarz et al 2016), a local linear modeling technique that uses random forests to identify supervised neighbors (Davies and Ghahramani 2014;He et al 2014), to generate a prediction. Specifically, given an instance to predict x t , SILO generates a local training distribution based on how often a training instance x i ends at the same terminal node as x t .…”

Section: Sample-based Explanationsmentioning

confidence: 99%

TREX: Tree-Ensemble Representer-Point Explanations

Brophy¹,

Lowd²

2020

Preprint

View full text Add to dashboard Cite

How can we identify the training examples that contribute most to the prediction of a tree ensemble? In this paper, we introduce TREX, an explanation system that provides instance-attribution explanations for tree ensembles, such as random forests and gradient boosted trees. TREX builds on the representer point framework previously developed for explaining deep neural networks. Since tree ensembles are nondifferentiable, we define a kernel that captures the structure of the specific tree ensemble. By using this kernel in kernel logistic regression or a support vector machine, TREX builds a surrogate model that approximates the original tree ensemble. The weights in the kernel expansion of the surrogate model are used to define the global or local importance of each training example. Our experiments show that TREX's surrogate model accurately approximates the tree ensemble; its global importance weights are more effective in dataset debugging than the previous state-of-the-art; its explanations identify the most influential samples better than alternative methods under the remove and retrain evaluation framework; it runs orders of magnitude faster than alternative methods; and its local explanations can identify and explain errors due to domain mismatch.

show abstract

Section: Sample-based Explanationsmentioning

confidence: 99%

TREX: Tree-Ensemble Representer-Point Explanations

Brophy¹,

Lowd²

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Another approach is to use random partitions [5]. Random partitions adopt a different approach in the sense that the method strives to infer the model from the training instances only, without any prior formulation of the measure or any similarity constraints.…”

Section: Learning Dissimilarity Representationsmentioning

confidence: 99%

“…The key idea of random partitions is to define multiple randomized partitions of the input space in such a way it forms homogeneous groups (clusters) of instances. It has been proven that such random partitions can be used to define kernels, which can be viewed as a (dis)similarity measurement [5], [6]. Beyond this mathematical demonstrations, random partitions can be directly used in practice to measure similarities, as with the well-known proximity measurement of random forests [4], [9], [10].…”

Section: Learning Dissimilarity Representationsmentioning

confidence: 99%

“…The originality of the method in [4] is the use of Random Forest (RF) classifiers to learn the dissimilarity representations. RF are powerful and versatile classifiers that incorporate a measure of (dis)similarity, with good theoretical properties [5], [6] and that can be learned in a non-parametric way, i.e. without prior formulation of the measure.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Novel Random Forest Dissimilarity Measure for Multi-View Learning

Cao

Bernard

Sabourin

2020

Preprint

View full text Add to dashboard Cite

Multi-view learning is a learning task in which data is described by several concurrent representations. Its main challenge is most often to exploit the complementarities between these representations to help solve a classification/regression task. This is a challenge that can be met nowadays if there is a large amount of data available for learning. However, this is not necessarily true for all real-world problems, where data are sometimes scarce (e.g. problems related to the medical environment). In these situations, an effective strategy is to use intermediate representations based on the dissimilarities between instances. This work presents new ways of constructing these dissimilarity representations, learning them from data with Random Forest classifiers. More precisely, two methods are proposed, which modify the Random Forest proximity measure, to adapt it to the context of High Dimension Low Sample Size (HDLSS) multi-view classification problems. The second method, based on an Instance Hardness measurement, is significantly more accurate than other state-of-the-art measurements including the original RF Proximity measurement and the Large Margin Nearest Neighbor (LMNN) metric learning measurement.

show abstract

“…Due to the intrinsic tree building process, random forest estimators can easily handle both univariate and multivariate data with few parameters to tune. Besides, these methods have good predictive power and can outperform standard kernel methods (Davies and Ghahramani, 2014;Scornet, 2016c). Lastly, being based on the random forest algorithm, they are also easily parallelizable and can handle large dataset.…”

Section: Introductionmentioning

confidence: 99%

Random forest estimation of conditional distribution functions and conditional quantiles

Elie-Dit-Cosaque¹,

Maume-Deschamps²

2020

Preprint

View full text Add to dashboard Cite

We propose a theoretical study of two realistic estimators of conditional distribution functions and conditional quantiles using random forests. The estimation process uses the bootstrap samples generated from the original dataset when constructing the forest. Bootstrap samples are reused to define the first estimator, while the second requires only the original sample, once the forest has been built. We prove that both proposed estimators of the conditional distribution functions are consistent uniformly a.s. To the best of our knowledge, it is the first proof of consistency including the bootstrap part. We also illustrate the estimation procedures on a numerical example.

show abstract

The Random Forest Kernel and other kernels for big data from random partitions

Cited by 19 publications

References 0 publications

TREX: Tree-Ensemble Representer-Point Explanations

TREX: Tree-Ensemble Representer-Point Explanations

A Novel Random Forest Dissimilarity Measure for Multi-View Learning

Random forest estimation of conditional distribution functions and conditional quantiles

Contact Info

Product

Resources

About