Like a hydra, fraudsters adapt and circumvent increasingly sophisticated barriers erected by public or private institutions. Among these institutions, banks must quickly take measures to avoid losses while guaranteeing the satisfaction of law-abiding customers. Facing an expanding flow of operations, effective banking relies on data analytics to support established risk control processes, but also on a better understanding of the underlying fraud mechanism. In addition, fraud being a criminal offence, the evidential aspect of the process must also be considered. These legal, operational, and strategic constraints lead to compromises on the means to be implemented for fraud management. This paper first focuses on the translation of practical questions raised in the banking industry at each step of the fraud management process into performance evaluation required to design a fraud detection model. Secondly, it considers a range of machine learning approaches that address these specificities: the imbalance between fraudulent and nonfraudulent operations, the lack of fully trusted labels, the concept-drift phenomenon, and the unavoidable trade-off between accuracy and interpretability of detection. This state-of-the-art review sheds some light on a technology race between black box machine learning models improved by post-hoc interpretation and intrinsic interpretable models boosted to gain accuracy. Finally, it discusses how concrete and promising hybrid approaches can provide pragmatic, short-term answers to banks and policy makers without swallowing up stakeholders with economical and ethical stakes in this technological race.
The exploration of heath data by clustering algorithms allows to better describe the populations of interest by seeking the sub-profiles that compose it. This therefore reinforces medical knowledge, whether it is about a disease or a targeted population in real life. Nevertheless, contrary to the so-called conventional biostatistical methods where numerous guidelines exist, the standardization of data science approaches in clinical research remains a little discussed subject. This results in a significant variability in the execution of data science projects, whether in terms of algorithms used, reliability and credibility of the designed approach. Taking the path of parsimonious and judicious choice of both algorithms and implementations at each stage, this article proposes Qluster, a practical workflow for performing clustering tasks. Indeed, this workflow makes a compromise between (1) genericity of applications (e.g. usable on small or big data, on continuous, categorical or mixed variables, on database of high-dimensionality or not), (2) ease of implementation (need for few packages, few algorithms, few parameters, ...), and (3) robustness (e.g. use of proven algorithms and robust packages, evaluation of the stability of clusters, management of noise and multicollinearity). This workflow can be easily automated and/or routinely applied on a wide range of clustering projects. It can be useful both for data scientists with little experience in the field to make data clustering easier and more robust, and for more experienced data scientists who are looking for a straightforward and reliable solution to routinely perform preliminary data mining. A synthesis of the literature on data clustering as well as the scientific rationale supporting the proposed workflow is also provided. Finally, a detailed application of the workflow on a concrete use case is provided, along with a practical discussion for data scientists. An implementation on the Dataiku platform is available upon request to the authors.
The exploration of heath data by clustering algorithms allows to better describe the populations of interest by seeking the sub-profiles that compose it. This therefore reinforces medical knowledge, whether it is about a disease or a targeted population in real life. Nevertheless, contrary to the so-called conventional biostatistical methods where numerous guidelines exist, the standardization of data science approaches in clinical research remains a little discussed subject. This results in a significant diversity in the execution of data science projects, whether in terms of algorithms used, reliability and credibility of the designed approach. Taking the path of parsimonious and judicious choice of both algorithms and implementations at each stage, this paper proposes Qluster, a practical workflow for performing clustering tasks. Indeed, this workflow makes a compromise between (1) genericity, as it is suitable regardless of the data volume (small/big) and regardless of the nature of the variables (continuous/qualitative/mixed), (2) ease of implementation, as it is based on few easy-to-use software packages, and (3) robustness, through the stability evaluation of the final clusters and through recognized algorithms and implementations. This workflow can be easily automated and/or routinely applied on a wide range of clustering projects. A synthesis of the literature on data clustering as well as the scientific rationale supporting the proposed workflow is also provided. Finally, a detailed application of the workflow on a concrete use case is provided, along with a practical discussion for data scientists. An implementation on the Dataiku platform is available upon request to the authors.
The exploration of heath data by clustering algorithms allows to better describe the populations of interest by seeking the sub-profiles that compose it. This therefore reinforces medical knowledge, whether it is about a disease or a targeted population in real life. Nevertheless, contrary to the so-called conventional biostatistical methods where numerous guidelines exist, the standardization of data science approaches in clinical research remains a little discussed subject. This results in a significant diversity in the execution of data science projects, whether in terms of algorithms used, reliability and credibility of the designed approach. Taking the path of parsimonious and judicious choice of both algorithms and implementations at each stage, this paper proposes Qluster, a practical workflow for performing clustering tasks. Indeed, this workflow makes a compromise between (1) genericity, as it is suitable regardless of the data volume (small/big) and regardless of the nature of the variables (continuous/qualitative/mixed), (2) ease of implementation, as it is based on few easy-to-use software packages, and (3) robustness, through the stability evaluation of the final clusters and through recognized algorithms and implementations. This workflow can be easily automated and/or routinely applied on a wide range of clustering projects. A synthesis of the literature on data clustering as well as the scientific rationale supporting the proposed workflow is also provided. Finally, a detailed application of the workflow on a concrete use case is provided, along with a practical discussion for data scientists. An implementation on the Dataiku platform is available upon request to the authors.
The exploration of heath data by clustering algorithms allows to better describe the populations of interest by seeking the sub-profiles that compose it. This therefore reinforces medical knowledge, whether it is about a disease or a targeted population in real life. Nevertheless, contrary to the so-called conventional biostatistical methods where numerous guidelines exist, the standardization of data science approaches in clinical research remains a little discussed subject. This results in a significant diversity in the execution of data science projects, whether in terms of algorithms used, reliability and credibility of the designed approach. Taking the path of parsimonious and judicious choice of both algorithms and implementations at each stage, this paper proposes Qluster, a practical workflow for performing clustering tasks. Indeed, this workflow makes a compromise between (1) genericity, as it is suitable regardless of the data volume (small/big) and regardless of the nature of the variables (continuous/qualitative/mixed), (2) ease of implementation, as it is based on few easy-to-use software packages, and (3) robustness, through the stability evaluation of the final clusters and through recognized algorithms and implementations. This workflow can be easily automated and/or routinely applied on a wide range of clustering projects. A synthesis of the literature on data clustering as well as the scientific rationale supporting the proposed workflow is also provided. Finally, a detailed application of the workflow on a concrete use case is provided, along with a practical discussion for data scientists. An implementation on the Dataiku platform is available upon request to the authors.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.