Explaining complex or seemingly simple machine learning models is an important practical problem. We want to explain individual predictions from a complex machine learning model by learning simple, interpretable explanations. Shapley values is a game theoretic concept that can be used for this purpose. The Shapley value framework has a series of desirable theoretical properties, and can in principle handle any predictive model. Kernel SHAP is a computationally efficient approximation to Shapley values in higher dimensions. Like several other existing methods, this approach assumes that the features are independent, which may give very wrong explanations. This is the case even if a simple linear model is used for predictions. In this paper, we extend the Kernel SHAP method to handle dependent features. We provide several examples of linear and non-linear models with various degrees of feature dependence, where our method gives more accurate approximations to the true Shapley values. We also propose a method for aggregating individual Shapley values, such that the prediction can be explained by groups of dependent variables.
Purpose The purpose of this paper is to develop, describe and validate a machine learning model for prioritising which financial transactions should be manually investigated for potential money laundering. The model is applied to a large data set from Norway’s largest bank, DNB. Design/methodology/approach A supervised machine learning model is trained by using three types of historic data: “normal” legal transactions; those flagged as suspicious by the bank’s internal alert system; and potential money laundering cases reported to the authorities. The model is trained to predict the probability that a new transaction should be reported, using information such as background information about the sender/receiver, their earlier behaviour and their transaction history. Findings The paper demonstrates that the common approach of not using non-reported alerts (i.e. transactions that are investigated but not reported) in the training of the model can lead to sub-optimal results. The same applies to the use of normal (un-investigated) transactions. Our developed method outperforms the bank’s current approach in terms of a fair measure of performance. Originality/value This research study is one of very few published anti-money laundering (AML) models for suspicious transactions that have been applied to a realistically sized data set. The paper also presents a new performance measure specifically tailored to compare the proposed method to the bank’s existing AML system.
Should one rely on a parametric or nonparametric model when analysing a given data set? This classic question cannot be answered by traditional model selection criteria like AIC and BIC, since a nonparametric model has no likelihood. The purpose of the present paper is to develop a focused information criterion (FIC) for comparing general non-nested parametric models with a nonparametric alternative. It relies in part on the notion of a focus parameter, a population quantity of particular interest in the statistical analysis. The FIC compares and ranks candidate models based on estimated precision of the different model-based estimators for the focus parameter. It has earlier been developed for several classes of problems, but mainly involving parametric models. The new FIC, including also nonparametrics, is novel also in the mathematical context, being derived without the local neighbourhood asymptotics underlying previous versions of FIC. Certain average-weighted versions, called AFIC, allowing several focus parameters to be considered simultaneously, are also developed. We concentrate on the standard i.i.d. setting and certain direct extensions thereof, but also sketch further generalisations to other types of data. Theoretical and simulation-based results demonstrate desirable properties and satisfactory performance.
Cox's proportional hazards regression model is the standard method for modelling censored life-time data with covariates. In its standard form, this method relies on a semiparametric proportional hazards structure, leaving the baseline unspecified. Naturally, specifying a parametric model also for the baseline hazard, leading to fully parametric Cox models, will be more efficient when the parametric model is correct, or close to correct. The aim of this paper is two-fold. (a) We compare parametric and semiparametric models in terms of their asymptotic relative efficiencies when estimating different quantities. We find that for some quantities the gain of restricting the model space is substantial, while it is negligible for others. (b) To deal with such selection in practice we develop certain focused and averaged focused information criteria (FIC and AFIC). These aim at selecting the most appropriate proportional hazards models for given purposes. Our methodology applies also to the simpler case without covariates, when comparing Kaplan-Meier and Nelson-Aalen estimators to parametric counterparts. Applications to real data are also provided, along with analyses of theoretical behavioural aspects of our methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.