Abstract:A gradient boosting decision tree (GBDT), which aggregates a collection of single weak learners (i.e. decision trees), is widely used for data mining tasks. Because GBDT inherits the good performance from its ensemble essence, much attention has been drawn to the optimization of this model. With its popularization, an increasing need for model interpretation arises. Besides the commonly used feature importance as a global interpretation, feature contribution is a local measure that reveals the relationship bet… Show more
“…The proportion of positive samples in all training samples contained in this node is denoted as r t k (y), which can also be considered as the probability that the training sample contained in node k belongs to the predicted sample category y. The difference in proportion of positive samples in child node and its corresponding parent node can be viewed as the node importance of the child node [40][41][42]. The larger the difference, the higher the purity of the sample split to the child node compared to that of the parent node, thus the higher the importance of the child node for the classification problem.…”
Section: Traditional Random Forest Algorithmmentioning
Fault detection and diagnosis (FDD) has received considerable attention with the advent of big data. Many data-driven FDD procedures have been proposed, but most of them may not be accurate when data missing occurs. Therefore, this paper proposes an improved random forest (RF) based on decision paths, named DPRF, utilizing correction coefficients to compensate for the influence of incomplete data. In this DPRF model, intact training samples are firstly used to grow all the decision trees in the RF. Then, for each test sample that possibly contains missing values, the decision paths and the corresponding nodes importance scores are obtained, so that for each tree in the RF, the reliability score for the sample can be inferred. Thus, the prediction results of each decision tree for the sample will be assigned to certain reliability scores. The final prediction result is obtained according to the majority voting law, combining both the predicting results and the corresponding reliability scores. To prove the feasibility and effectiveness of the proposed method, the Tennessee Eastman (TE) process is tested. Compared with other FDD methods, the proposed DPRF model shows better performance on incomplete data.
“…The proportion of positive samples in all training samples contained in this node is denoted as r t k (y), which can also be considered as the probability that the training sample contained in node k belongs to the predicted sample category y. The difference in proportion of positive samples in child node and its corresponding parent node can be viewed as the node importance of the child node [40][41][42]. The larger the difference, the higher the purity of the sample split to the child node compared to that of the parent node, thus the higher the importance of the child node for the classification problem.…”
Section: Traditional Random Forest Algorithmmentioning
Fault detection and diagnosis (FDD) has received considerable attention with the advent of big data. Many data-driven FDD procedures have been proposed, but most of them may not be accurate when data missing occurs. Therefore, this paper proposes an improved random forest (RF) based on decision paths, named DPRF, utilizing correction coefficients to compensate for the influence of incomplete data. In this DPRF model, intact training samples are firstly used to grow all the decision trees in the RF. Then, for each test sample that possibly contains missing values, the decision paths and the corresponding nodes importance scores are obtained, so that for each tree in the RF, the reliability score for the sample can be inferred. Thus, the prediction results of each decision tree for the sample will be assigned to certain reliability scores. The final prediction result is obtained according to the majority voting law, combining both the predicting results and the corresponding reliability scores. To prove the feasibility and effectiveness of the proposed method, the Tennessee Eastman (TE) process is tested. Compared with other FDD methods, the proposed DPRF model shows better performance on incomplete data.
“…GBDT is a common choice in machine learning tasks. Besides the high performance and efficiency, GBDT and its variants also provides the model interpretability [47] and the easiness of parameter tuning. The most direct transfer is first train a model on the source dataset.…”
Secure online transaction is an essential task for ecommerce platforms. Alipay, one of the world's leading cashless payment platform, provides the payment service to both merchants and individual customers. The fraud detection models are built to protect the customers, but stronger demands are raised by the new scenes, which are lacking in training data and labels. The proposed model makes a difference by utilizing the data under similar old scenes and the data under a new scene is treated as the target domain to be promoted. Inspired by this real case in Alipay, we view the problem as a transfer learning problem and design a set of revise strategies to transfer the source domain models to the target domain under the framework of gradient boosting tree models. This work provides an option for the cold-start and data-sharing problems.
“…The interpretability of boosted tree model in both global and local level has been shown in [3]. In our work, since the whole model of each task consists of the common part and the specific part, so we collect them all to get the whole importance of each feature.…”
Section: Interpretabilitymentioning
confidence: 99%
“…In our work, since the whole model of each task consists of the common part and the specific part, so we collect them all to get the whole importance of each feature. For each instance, the contribution of each feature to the final prediction can be calculated with the method in [3]. An example of the top 20 important feature in task2 of Scene1 is shown in figure 2.…”
Section: Interpretabilitymentioning
confidence: 99%
“…(2) The construction of the trees in the common model may be unbeneficial or even harmful for some task after some rounds, which means that a mechanism is needed to find the proper round so that a task can quit from the common model training process if necessary. (3) The training of the second stage should take the information of the first stage into consideration, so that the obtained model can be more effective, instead of simply combining two boosted tree models when predicting for each task. To handle these problems, a regularization strategy is proposed for the construction of each tree to alleviate the domination problem, and an early stopping strategy is designed so that a task can quit the common process if further training will not improve its performance.…”
Section: Multi-task Boosted Tree 31 the Whole Frameworkmentioning
Multi-task learning (MTL) aims at improving the generalization performance of several related tasks by leveraging useful information contained in them. However, in industrial scenarios, interpretability is always demanded, and the data of different tasks may be in heterogeneous domains, making the existing methods unsuitable or unsatisfactory. In this paper, following the philosophy of boosted tree, we proposed a two-stage method. In stage one, a common model is built to learn the commonalities using the common features of all instances. Different from the training of conventional boosted tree model, we proposed a regularization strategy and an earlystopping mechanism to optimize the multi-task learning process. In stage two, started by fitting the residual error of the common model, a specific model is constructed with the task-specific instances to further boost the performance. Experiments on both benchmark and real-world datasets validate the effectiveness of the proposed method. What's more, interpretability can be naturally obtained from the tree based method, satisfying the industrial needs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.