An approach for classification of highly imbalanced data using weighting and undersampling

Anand, Ashish; Pugalenthi, Ganesan; Fogel, Gary B.; Suganthan, Ponnuthurai Nagaratnam

doi:10.1007/s00726-010-0595-2

Cited by 144 publications

(74 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Experiments done on different methods conclude with ambiguous results: while Anand et al [63], certified by Li et al [15] opt for sampling methods as optimal solution, we observe on the other front McCarthy et al [65]in agreement with Liu et al [88] on the superiority of the cost sensitive learning; while Quinlan [94] and Thomas [100] are approving ensemble learning methods; on the other hand Cieslak [38] and Marcellin [90] defend the algorithm modification approaches.…”

Section: Discussionmentioning

confidence: 99%

Imbalanced Data Learning Approaches Review

Bekkar¹,

Alitouche²

2013

IJDKP

142

View full text Add to dashboard Cite

show abstract

Section: Discussionmentioning

confidence: 99%

Imbalanced Data Learning Approaches Review

Bekkar¹,

Alitouche²

2013

IJDKP

142

View full text Add to dashboard Cite

show abstract

“…As a result, accuracy is not used to evaluate the performance of classifier for imbalance datasets, and more reasonable evaluation metrics should be presented [33,34].…”

Section: Evaluation Measuresmentioning

confidence: 99%

“…In medical science, bioinformatics, and machine learning communities [23,24,33,34], the sensitivity (SE) and the specificity (SP) are two metrics used to evaluate the performance of classifiers. Sensitivity measures the proportion of actual positives which are correctly identified as such, while specificity can be defined as the proportion of negatives which are correctly identified.…”

Section: Evaluation Measuresmentioning

confidence: 99%

A novel over-sampling method and its application to miRNA prediction

Dang¹,

Hirose²,

Saethang³

et al. 2013

JBiSE

View full text Add to dashboard Cite

MicroRNAs (miRNAs) are short (~22 nt) non-coding RNAs that play an indispensable role in gene regulation of many biological processes. Most of current computational, comparative, and non-comparative methods commonly classify human precursor micro-RNA (pre-miRNA) hairpins from both genome pseudo hairpins and other non-coding RNAs (ncRNAs). Although there were a few approaches achieving promising results in applying class imbalance learning methods, this issue has still not solved completely and successfully yet by the existing methods because of imbalanced class distribution in the datasets. For example, SMOTE is a famous and general over-sampling method addressing this problem, however in some cases it cannot improve or sometimes reduces classification performance. Therefore, we developed a novel over-sampling method named incre-mental-SMOTE to distinguish human pre-miRNA hairpins from both genome pseudo hairpins and other ncRNAs. Experimental results on pre-miRNA datasets from Batuwita et al. showed that our method achieved better Sensitivity and G-mean than the control (no oversampling), SMOTE, and several successsors of modified SMOTE including safe-level-SMOTE and border-line-SMOTE. In addition, we also applied the novel method to five imbalanced benchmark datasets from UCI Machine Learning Repository and achieved improvements in Sensitivity and G-mean. These results suggest that our method outperforms SMOTE and several successors of it in various biomedical classification problems including miRNA classification.

show abstract

“…This measure tries to maximize the accuracy of both classes while keeping the two accuracies balanced. Several researchers have used this metric for evaluating classifiers on imbalanced datasets (Kubat and Matwin 1997;Robert et al 1997;Wu and Chang 2003;Anand et al 2010). We also utilize this metric to evaluate SVM classifier for the high imbalanced c-turn dataset and modify the evaluation criterion of LibSVM (Chang and Lin 2001) using the G-mean metric in this study.…”

Section: Imbalanced Problemmentioning

confidence: 99%

Using predicted shape string to enhance the accuracy of γ-turn prediction

Zhu

et al. 2011

Amino Acids

View full text Add to dashboard Cite

Numerous methods for predicting γ-turns in proteins have been developed. However, the results they generally provided are not very good, with a Matthews correlation coefficient (MCC)≤0.18. Here, an attempt has been made to develop a method to improve the accuracy of γ-turn prediction. First, we employ the geometric mean metric as optimal criterion to evaluate the performance of support vector machine for the highly imbalanced γ-turn dataset. This metric tries to maximize both the sensitivity and the specificity while keeping them balanced. Second, a predictor to generate protein shape string by structure alignment against the protein structure database has been designed and the predicted shape string is introduced as new variable for γ-turn prediction. Based on this perception, we have developed a new method for γ-turn prediction. After training and testing the benchmark dataset of 320 non-homologous protein chains using a fivefold cross-validation technique, the present method achieves excellent performance. The overall prediction accuracy Qtotal can achieve 92.2% and the MCC is 0.38, which outperform the existing γ-turn prediction methods. Our results indicate that the protein shape string is useful for predicting protein tight turns and it is reasonable to use the dihedral angle information as a variable for machine learning to predict protein folding. The dataset used in this work and the software to generate predicted shape string from structure database can be obtained from anonymous ftp site ftp://cheminfo.tongji.edu.cn/GammaTurnPrediction/ freely.

show abstract

An approach for classification of highly imbalanced data using weighting and undersampling

Cited by 144 publications

References 32 publications

Imbalanced Data Learning Approaches Review

Imbalanced Data Learning Approaches Review

A novel over-sampling method and its application to miRNA prediction

Using predicted shape string to enhance the accuracy of γ-turn prediction

Contact Info

Product

Resources

About