Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem

Rendón, Eréndira; Alejo, R.; Castorena, Carlos; Isidro-Ortega, Frank J.; Granda‐Gutiérrez, E. E.

doi:10.3390/app10041276

Cited by 79 publications

(47 citation statements)

References 64 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Imbalance data remains a key challenge against classification models [15,18]. The majority of literature considered re-sampling approaches, i.e., both over-sampling and under-sampling, to alleviate degradation due to the issue of imbalanced data [1,17,19,33,37].…”

Section: Theoretical Backgroundmentioning

confidence: 99%

“…The low volume of the potential target/important customer data (i.e., imbalanced data distribution) is a major challenge in extracting the latent knowledge in banks marketing data [1,3,10]. There is still an insisting need for handling the imbalanced dataset distribution reliably [15][16][17]; commonly used approaches [1,15,16,[18][19][20][21] impose processing overhead or lead to loss of information.…”

Section: Introductionmentioning

confidence: 99%

“…For a few decades, different interesting research efforts have attempted to improve understanding of the behavior of customers using ANNs [1,14,[22][23][24]26,27,31]. However, cost-sensitive algorithms have been with marginal interest to the researchers in bank marketing, while pre-processing the input dataset by re-sampling techniques to solve imbalanced class distribution have gained significant interest [1,15]. This research seeks unleashing the potentials of cost-sensitive analysis in providing enhanced bank marketing models, and reliable handling of imbalanced data distributions.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Business Analytics in Telemarketing: Cost-Sensitive Analysis of Bank Campaigns Using Artificial Neural Networks

et al. 2020

View full text Add to dashboard Cite

The banking industry has been seeking novel ways to leverage database marketing efficiency. However, the nature of bank marketing data hindered the researchers in the process of finding a reliable analytical scheme. Various studies have attempted to improve the performance of Artificial Neural Networks in predicting clients’ intentions but did not resolve the issue of imbalanced data. This research aims at improving the performance of predicting the willingness of bank clients to apply for a term deposit in highly imbalanced datasets. It proposes enhanced Artificial Neural Network models (i.e., cost-sensitive) to mitigate the dramatic effects of highly imbalanced data, without distorting the original data samples. The generated models are evaluated, validated, and consequently compared to different machine-learning models. A real-world telemarketing dataset from a Portuguese bank is used in all the experiments. The best prediction model achieved 79% of geometric mean, and misclassification errors were minimized to 0.192, 0.229 of Type I & Type II Errors, respectively. In summary, an interesting Meta-Cost method improved the performance of the prediction model without imposing significant processing overhead or altering original data samples.

show abstract

Section: Theoretical Backgroundmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Business Analytics in Telemarketing: Cost-Sensitive Analysis of Bank Campaigns Using Artificial Neural Networks

et al. 2020

View full text Add to dashboard Cite

show abstract

“…We used two established approaches, namely the Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbors (ENN), to balance the IoT-23, LITNET-2020, and NetML-2020 datasets [ 22 , 23 ]. Recently, hybrid approaches have become popular [ 24 ]. Methods like SMOTE+ENN, among other, have often been utilized for alleviating the issue of class imbalance to boost the efficiency of the classifier.…”

Section: Proposed Methodologymentioning

confidence: 99%

“…The count of samples in each class in the pre-processed dataset is subjected to the balancing procedure. Following [ 24 ], the

approach for class balancing is presented in Algorithm 1. Algorithm 1 SMOTE+ENN.

…”

Section: Proposed Methodologymentioning

confidence: 99%

A Deep Learning Ensemble for Network Anomaly and Cyber-Attack Detection

Dutta

Choraś

Pawlicki

et al. 2020

Sensors

103

View full text Add to dashboard Cite

Currently, expert systems and applied machine learning algorithms are widely used to automate network intrusion detection. In critical infrastructure applications of communication technologies, the interaction among various industrial control systems and the Internet environment intrinsic to the IoT technology makes them susceptible to cyber-attacks. Given the existence of the enormous network traffic in critical Cyber-Physical Systems (CPSs), traditional methods of machine learning implemented in network anomaly detection are inefficient. Therefore, recently developed machine learning techniques, with the emphasis on deep learning, are finding their successful implementations in the detection and classification of anomalies at both the network and host levels. This paper presents an ensemble method that leverages deep models such as the Deep Neural Network (DNN) and Long Short-Term Memory (LSTM) and a meta-classifier (i.e., logistic regression) following the principle of stacked generalization. To enhance the capabilities of the proposed approach, the method utilizes a two-step process for the apprehension of network anomalies. In the first stage, data pre-processing, a Deep Sparse AutoEncoder (DSAE) is employed for the feature engineering problem. In the second phase, a stacking ensemble learning approach is utilized for classification. The efficiency of the method disclosed in this work is tested on heterogeneous datasets, including data gathered in the IoT environment, namely IoT-23, LITNET-2020, and NetML-2020. The results of the evaluation of the proposed approach are discussed. Statistical significance is tested and compared to the state-of-the-art approaches in network anomaly detection.

show abstract

Design of experiments with the support of machine learning for process parameter optimization of all‐small‐molecule organic solar cells

Wang,

Liang,

et al. 2024

FlexMat

View full text Add to dashboard Cite

Traditionally, squaraine dyes have been studied and employed in biomedical research due to their excellent optical properties, and the molecules are being adopted in different research fields such as organic solar cells. In this study, we investigate correlations between solar cell performance and processing parameters of all‐small‐molecule bulk heterojunction solar cells comprising squaraine (SQ) as electron donor (D) and non‐fullerene small molecules (e.g., ITIC) as electron acceptor (A) with the help of machine learning (ML) and design of experiment (DoE) methods. Among the five predictive ML models tested with the selected parameters, the eXtreme gradient boosting model shows the satisfactory results with quite high coefficient of determination of 0.999 and 0.997 in training and testing sets, respectively. By measuring the contribution of each input variable to solar cell efficiency, four process parameters, that is, the total concentration, the ratio of D/A, the rotational speed of spin coating, and the annealing temperature, are found to be the key features strongly correlated to solar cell efficiency. From contour plots in DoE, the highest solar cell efficiency of approximately 5% can be predicted under the conditions of 15 mg mL−1 in solution concentration, a 1:2 mix ratio of D and A, rotational speeds ranging from 800 to 900 rpm, and annealing temperatures within 100–110°C. Using the suggested parameter conditions, we fabricated solar cells, achieving a quite high efficiency of approximately 4%. Besides the global optimization conditions, we also employ the solvent vapor annealing combination to the thermal annealing to facilitate further mobilization of molecules and more optimized microstructure of bulk heterojunction films, resulting in a further enhancement in solar cell efficiency of more than 20%.

show abstract

Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem

Cited by 79 publications

References 64 publications

Business Analytics in Telemarketing: Cost-Sensitive Analysis of Bank Campaigns Using Artificial Neural Networks

Business Analytics in Telemarketing: Cost-Sensitive Analysis of Bank Campaigns Using Artificial Neural Networks

A Deep Learning Ensemble for Network Anomaly and Cyber-Attack Detection

Design of experiments with the support of machine learning for process parameter optimization of all‐small‐molecule organic solar cells

Contact Info

Product

Resources

About