A Novel Model for Imbalanced Data Classification

Yin, Jian; Gan, Chunjing; Zhao, Kaiqi; Lin, Xuan; Quan, Zhe; Wang, Zhijie

doi:10.1609/aaai.v34i04.6145

Cited by 24 publications

(8 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Solutions to tackle the imbalance problem of classification can be broadly classified into four major families [9]: sampling methods (including oversampling and undersampling) [34,35], costsensitive learning [29], distance metric learning [36], and ensemble learning [37] and hybrid methods which integrate the features from different families such as Adacost [38], RUSBoost [39] and DDAE [40].…”

Section: Figure 2 Class Distribution On the Dataset With Ir=12mentioning

confidence: 99%

“…This paper focuses on comparing the performance of these algorithms on multiple datasets from several different domains, including healthcare, card playing, software development projects and hand-written digit recognition. We use these datasets to evaluate ten imbalanced classification algorithms, namely 1) sampling: SMOTE [35] and MWMOTE [42]; 2) costsensitive learning: MetaCost [43], CAdaMEC [44] and cost-sensitive decision tree [9]; 3) distance metric learning: Iterative Metric Learning (IML) [36]; 4) ensemble learning and hybrid methods: AdaBoost [45], RUSBoost [39], self-paced Ensemble Classifier [11] and DDAE [40]. Our experiments not only analyze the performance of different models based on a general set of evaluation metrics on the same dataset, but also quantify the impact of key factors related to imbalanced learning, such as the size of the dataset and the imbalance ratio, as well as system performance in terms of learning time and memory usage.…”

Section: Figure 2 Class Distribution On the Dataset With Ir=12mentioning

confidence: 99%

“…DDAE. DDAE[40] is a novel model to address the class imbalance problem consisting of resampling, data metrics learning, cost-sensitive learning, and ensemble learning. Besides using kNN as the base classifier, DDAE has four components: 1) Data Block Construction (DBC), divides the training (both minority and majority) samples into different number of data blocks based on the given balanced ratio; 2) Data Space Improvement (DSI) applies LMNN (similar to IML) to improve the data space (through relabeling and regrouping the neighbouring data blocks) for training samples in each data block generated in the DBC component; 3) Adaptive Weight Adjustment (AWA) finds an appropriate overall class weight generated using the data coming from each data block[89]; 4) Ensemble Learning (EL) leverages ensemble learning with the weight determined via AWA; multiple base classifiers with major voting technique work on the final decision for each input sample.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Experimental Comparison of Classification Methods under Class Imbalance

Chen¹,

Ji²

2021

ICST Transactions on Scalable Information Systems

View full text Add to dashboard Cite

The class imbalance problem is prevalent in many domains including medical, natural language processing, image recognition, economic and geographic areas etc. We perform a systematic experimental comparison of different imbalance classification algorithms -ranging from sampling, distance metric learning, costsensitive learning to ensemble learning approaches -on several datasets from UCI, KEEL and OpenML. The algorithms included DDAE, MWMOTE, SMOTE, RUSBoost, AdaBoost, cost-sensitive decision tree (csDCT), self-paced Ensemble Classifier, MetaCost, CAdaMEC and Iterative Metric Learning (IML). As the substantial bias potentially caused by imbalance classification can be harmful for underrepresented classes which are of critical social and economic values and even lives, the main objective of our study is thus to understand the impact of imbalance ratio and the size of the utilized datasets on the performance of the above-mentioned algorithms. Our experiments show that 1) Sampling methods perform the worst and cannot be used directly for imbalanced classification, since they lack of consideration of neighborhoods based on distance. However, some classifiers can be improved after the balance of class distribution. 2) Cost-sensitive learning models should be utilized when the dataset is less imbalanced, because it is difficult to set an appropriate cost matrix for a specific dataset, which can cause performance fluctuations. 3) IML consistently shows good performance (in terms of F1 and AUCPRC), is resilient to different imbalance ratios but sensitive to the data distribution of the dataset. 4) Ensemble learning techniques generally perform better over other approaches due to their combined intelligence of multiple basic classifiers. 5) In terms of system performance, self-paced Ensemble Classifier performs fairly well with regards to learning time, while IML and DDAE yield the longest learning time; AdaBoost and self-paced Ensemble Classifier are two algorithms require lowest memory usage. We also provide our empirical recommendation for algorithm selection under different requirements and usage scenarios based on our analysis.

show abstract

Section: Figure 2 Class Distribution On the Dataset With Ir=12mentioning

confidence: 99%

Section: Figure 2 Class Distribution On the Dataset With Ir=12mentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Experimental Comparison of Classification Methods under Class Imbalance

Chen¹,

Ji²

2021

ICST Transactions on Scalable Information Systems

View full text Add to dashboard Cite

show abstract

“…Inspired by the remarkable achievements that deep learning has shown in a variety of domains, including computer vision [14] and natural language processing [15,16], it also has gained lots of attention for molecular property prediction. The molecular representation methods being introduced can be mainly summarized into two parts: sequencebased and graph-based approaches.…”

Section: Introductionmentioning

confidence: 99%

Mol‐BERT: An Effective Molecular Representation with BERT for Molecular Property Prediction

Jiang

2021

Wireless Communications and Mobile Computing

View full text Add to dashboard Cite

Molecular property prediction is an essential task in drug discovery. Most computational approaches with deep learning techniques either focus on designing novel molecular representation or combining with some advanced models together. However, researchers pay fewer attention to the potential benefits in massive unlabeled molecular data (e.g., ZINC). This task becomes increasingly challenging owing to the limitation of the scale of labeled data. Motivated by the recent advancements of pretrained models in natural language processing, the drug molecule can be naturally viewed as language to some extent. In this paper, we investigate how to develop the pretrained model BERT to extract useful molecular substructure information for molecular property prediction. We present a novel end-to-end deep learning framework, named Mol-BERT, that combines an effective molecular representation with pretrained BERT model tailored for molecular property prediction. Specifically, a large-scale prediction BERT model is pretrained to generate the embedding of molecular substructures, by using four million unlabeled drug SMILES (i.e., ZINC 15 and ChEMBL 27). Then, the pretrained BERT model can be fine-tuned on various molecular property prediction tasks. To examine the performance of our proposed Mol-BERT, we conduct several experiments on 4 widely used molecular datasets. In comparison to the traditional and state-of-the-art baselines, the results illustrate that our proposed Mol-BERT can outperform the current sequence-based methods and achieve at least 2% improvement on ROC-AUC score on Tox21, SIDER, and ClinTox dataset.

show abstract

“…It automatically learns to extract features without professional knowledge for feature extraction. In manufacturing, the defect detection is a typical imbalanced data problem and defect samples are usually less than nondefective ones [15]. The imbalance ratio (IR) is usually used to describe the ratio of minority to majority samples.…”

Section: Introductionmentioning

confidence: 99%

Classification of Imbalanced Data Using Deep Learning with Adding Noise

Fan

Lee

2021

Journal of Sensors

View full text Add to dashboard Cite

This paper proposes a method to treat the classification of imbalanced data by adding noise to the feature space of convolutional neural network (CNN) without changing a data set (ratio of majority and minority data). Besides, a hybrid loss function of crossentropy and KL divergence is proposed. The proposed approach can improve the accuracy of minority class in the testing data. In addition, a simple design method for selecting structure of CNN is first introduced and then, we add noise in feature space of CNN to obtain proper features by a training process and to improve the classification results. From comparison results, we can find that the proposed method can extract the suitable features to improve the accuracy of minority class. Finally, illustrated examples of multiclass classification problems and the corresponding discussion in balance ratio are presented. Our approach performs well with smaller network structure compared with other deep models. In addition, the performance is improved over 40% in defective accuracy by adding noise approach. Finally, the accuracy is higher than 96%; even the imbalanced ratio (IR) is one hundred.

show abstract

A Novel Model for Imbalanced Data Classification

Cited by 24 publications

References 20 publications

Experimental Comparison of Classification Methods under Class Imbalance

Experimental Comparison of Classification Methods under Class Imbalance

Mol‐BERT: An Effective Molecular Representation with BERT for Molecular Property Prediction

Classification of Imbalanced Data Using Deep Learning with Adding Noise

Contact Info

Product

Resources

About