Customer retention is a major issue for various service-based organizations particularly telecom industry, wherein predictive models for observing the behavior of customers are one of the great instruments in customer retention process and inferring the future behavior of the customers. However, the performances of predictive models are greatly affected when the real-world data set is highly imbalanced. A data set is called imbalanced if the samples size from one class is very much smaller or larger than the other classes. The most commonly used technique is over/under sampling for handling the class-imbalance problem (CIP) in various domains. In this paper, we survey six well-known sampling techniques and compare the performances of these key techniques, i.e., mega-trend diffusion function (MTDF), synthetic minority oversampling technique, adaptive synthetic sampling approach, couples top-N reverse k-nearest neighbor, majority weighted minority oversampling technique, and immune centroids oversampling technique. Moreover, this paper also reveals the evaluation of four rules-generation algorithms (the learning from example module, version 2 (LEM2), covering, exhaustive, and genetic algorithms) using publicly available data sets. The empirical results demonstrate that the overall predictive performance of MTDF and rules-generation based on genetic algorithms performed the best as compared with the rest of the evaluated oversampling methods and rule-generation algorithms
With the terrific growth of digital data and associated technologies, there is an emerging trend, where industries become rapidly digitized. These technologies are providing great opportunities to identify and resolve different problems. In particular, the telecommunication industry is facing a serious problem of customer churn relating to, the customers who are going to abandon their established relation with the business/network in the near future. This problem cannot only affect the rapid growth of the business but can also affect the revenues. Therefore, many customer churn prediction (CCP) models have been introduced but not yielding the desired performance in CCP. This is because there can be many factors, that contribute to customer churn which are still unexplored. In this paper, we focus on determining the effectiveness of the factors, i.e. lower and upper distance between the samples, are considered by the proposed model for the CCP. Further, we demonstrate a novel solution pertaining to the telecommunication sector showing the hidden factors considered for predicting the customer churn. Finally, we investigate the effects of both types of samples: those samples that are low distance and the upper distance (in terms of relevance) to the majority samples in given publicly available dataset. As a result of the study, we found that lower distance test set (LDT) samples have obtained best performance as compare to upper distance test set (UDT) samples in term of increased in the accuracy, f-measures, precision and recall when the uncertain sample size increases. Because the classification performance on upper distance samples remain almost the same when the size of samples increased in the test set.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.