A novel customer churn prediction model for the telecommunication industry using data transformation methods and feature selection

Sana, Joydeb Kumar; Abedin, Mohammad Zoynul; Rahman, Mohammad Shahriar; Rahman, Mohammad Saifur

doi:10.1371/journal.pone.0278095

Cited by 13 publications

(7 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When training over the data which contains the class imbalance problem, the ML models will overclassify the majority class. Mostly, the classifiers focus on majority class rather than misclassifying or ignoring the minority class ( Kaur, Pannu & Malhi, 2020 ; Sana et al, 2022 ; Saha et al, 2023 ). Therefore, for acquiring better and accurate results, we need to handle the class imbalance problem.…”

Section: Methodsmentioning

confidence: 99%

An autonomous mixed data oversampling method for AIOT-based churn recognition and personalized recommendations using behavioral segmentation

Fatima,

Khan,

Aadil

et al. 2024

PeerJ Computer Science

View full text Add to dashboard Cite

The telecom sector is currently undergoing a digital transformation by integrating artificial intelligence (AI) and Internet of Things (IoT) technologies. Customer retention in this context relies on the application of autonomous AI methods for analyzing IoT device data patterns in relation to the offered service packages. One significant challenge in existing studies is treating churn recognition and customer segmentation as separate tasks, which diminishes overall system accuracy. This study introduces an innovative approach by leveraging a unified customer analytics platform that treats churn recognition and segmentation as a bi-level optimization problem. The proposed framework includes an Auto Machine Learning (AutoML) oversampling method, effectively handling three mixed datasets of customer churn features while addressing imbalanced-class distribution issues. To enhance performance, the study utilizes the strength of oversampling methods like synthetic minority oversampling technique for nominal and continuous features (SMOTE-NC) and synthetic minority oversampling with encoded nominal and continuous features (SMOTE-ENC). Performance evaluation, using 10-fold cross-validation, measures accuracy and F1-score. Simulation results demonstrate that the proposed strategy, particularly Random Forest (RF) with SMOTE-NC, outperforms standard methods with SMOTE. It achieves accuracy rates of 79.24%, 94.54%, and 69.57%, and F1-scores of 65.25%, 81.87%, and 45.62% for the IBM, Kaggle Telco and Cell2Cell datasets, respectively. The proposed method autonomously determines the number and density of clusters. Factor analysis employing Bayesian logistic regression identifies influential factors for accurate customer segmentation. Furthermore, the study segments consumers behaviorally and generates targeted recommendations for personalized service packages, benefiting decision-makers.

show abstract

Section: Methodsmentioning

confidence: 99%

An autonomous mixed data oversampling method for AIOT-based churn recognition and personalized recommendations using behavioral segmentation

Fatima,

Khan,

Aadil

et al. 2024

PeerJ Computer Science

View full text Add to dashboard Cite

show abstract

“…However, in the case of class imbalance, the accuracy may be affected by the uneven distribution of categories. The F1 score, denoting the harmonic mean of precision and recall [39], is a comprehensive evaluation metric, especially well-suited for dealing with unbalanced datasets. A superior F1 score generally suggests that the model maintains a more effective equilibrium between recall and precision [35].…”

Section: Evaluation Measuresmentioning

confidence: 99%

“…A superior F1 score generally suggests that the model maintains a more effective equilibrium between recall and precision [35]. A perfect model has an F1 score of 1 [39].…”

Section: Evaluation Measuresmentioning

confidence: 99%

The causes of bank customer churn based on XGBoost and LightGBMmodels: the evidence from the Kaggle dataset

Cai

2024

View full text Add to dashboard Cite

As the digital economy continues to grow, the expansion of Internet finance introduces new challenges for the conventional banking industry. Banks must deal with multiple pressures, such as digital transformation, declining customer loyalty, and fintech competition. Analyzing the potential factors of bank customer churn from multiple perspectives and constructing models for predicting churn can help bank managers understand the causes of churn, identify problems, detect potential churn customers promptly, and develop efficient retention strategies based on customer characteristics and preferences. In this paper, we used a combination of visualization, data mining, and machine learning methods to analyze the factors used to predict bank customer churn from multiple perspectives, such as feature selection (Random Forest Feature Importance Ranking), feature extraction (PCA), visualization, etc. We also constructed two churn prediction models based on the gradient boosting tree algorithms, XGBoost and LightGBM, compared the evaluation measures before and after feature selection and before and after tuning parameters, and interpreted the model through SHAP methods. After the paper, the following conclusions were drawn: (1) Total Trans Amt, Total Trans Ct, and Total Revolving Bal are pivotal in analyzing and predicting customer churn; (2) the SHAP Summary Plot can react to the visual analysis of predictors of customer churn to a certain extent; (3) the effect of feature selection on the assessment of the results is sometimes insignificant; (4) tuning parameter settings can enhance model performance to a certain extent, but the optimal parameters may vary based on the preprocessing method employed. These conclusions will assist banks in comprehending customer churn factors more deeply, constructing a higher performance churn prediction model, and conducting a comprehensive result synthesis analysis.

show abstract

“…In literature [10], a credit default prediction model was developed using GBDT and the K-means SMOTE oversampling method was used to address the imbalance in the data set, while the original hypothesis was rejected with a p-value < 0.001 using one-way analysis of variance, confirming the statistical significance of the improved performance of the proposed model. Literature [11] uses univariate techniques for feature selection in the customer churn domain and uses a grid search approach to select the optimal hyperparameters for the optimal model GDBT, demonstrating the benefits of applying data transformation methods and feature selection when training an optimized CCP model. Literature [12] proposes a default prediction model based on decision tree model using XGBoost model in integrated learning for accurate prediction of customer default in P2P lending, and also applies feature ranking based on learning model to P2P lending credit data with hyperparameter optimization for individual classifiers.…”

Section: Customer Churn Predictionmentioning

confidence: 99%

“…Although the above studies have contributed to customer churn prediction, most of the current studies on customer churn prediction have used ensemble learning methods to construct customer churn models, for example, in the literature [10][11][12][13][14][15][16], ensemble learning has been used to construct the corresponding models. The ensemble learning approach, as a black box model with high complexity, cannot justify the prediction results of the models used.…”

Section: Customer Churn Predictionmentioning

confidence: 99%

Research on customer churn prediction and model interpretability analysis

Peng,

2023

PLoS ONE

View full text Add to dashboard Cite

In recent years, with the continuous improvement of the financial system and the rapid development of the banking industry, the competition of the banking industry itself has intensified. At the same time, with the rapid development of information technology and Internet technology, customers’ choice of financial products is becoming more and more diversified, and customers’ dependence and loyalty to banking institutions is becoming less and less, and the problem of customer churn in commercial banks is becoming more and more prominent. How to predict customer behavior and retain existing customers has become a major challenge for banks to solve. Therefore, this study takes a bank’s business data on Kaggle platform as the research object, uses multiple sampling methods to compare the data for balancing, constructs a bank customer churn prediction model for churn identification by GA-XGBoost, and conducts interpretability analysis on the GA-XGBoost model to provide decision support and suggestions for the banking industry to prevent customer churn. The results show that: (1) The applied SMOTEENN is more effective than SMOTE and ADASYN in dealing with the imbalance of banking data. (2) The F1 and AUC values of the model improved and optimized by XGBoost using genetic algorithm can reach 90% and 99%, respectively, which are optimal compared to other six machine learning models. The GA-XGBoost classifier was identified as the best solution for the customer churn problem. (3) Using Shapley values, we explain how each feature affects the model results, and analyze the features that have a high impact on the model prediction, such as the total number of transactions in the past year, the amount of transactions in the past year, the number of products owned by customers, and the total sales balance. The contribution of this paper is mainly in two aspects: (1) this study can provide useful information from the black box model based on the accurate identification of churned customers, which can provide reference for commercial banks to improve their service quality and retain customers; (2) it can provide reference for customer churn early warning models of other related industries, which can help the banking industry to maintain customer stability, maintain market position and reduce corporate losses.

show abstract

A novel customer churn prediction model for the telecommunication industry using data transformation methods and feature selection

Cited by 13 publications

References 34 publications

An autonomous mixed data oversampling method for AIOT-based churn recognition and personalized recommendations using behavioral segmentation

An autonomous mixed data oversampling method for AIOT-based churn recognition and personalized recommendations using behavioral segmentation

The causes of bank customer churn based on XGBoost and LightGBMmodels: the evidence from the Kaggle dataset

Research on customer churn prediction and model interpretability analysis

Contact Info

Product

Resources

About