Class imbalance revisited: a new experimental setup to assess the performance of treatment methods

Prati, Ronaldo C.; Batista, Gustavo E. A. P. A.; Silva, Diego F.

doi:10.1007/s10115-014-0794-3

Cited by 154 publications

(81 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…A more global review on learning from skewed data was proposed by Branco [5] and concentrates on a more general issue of imbalanced predictive modeling. Among more specialized discussions on this topic a thorough survey on ensemble learning by Galar et al [17], an indepth insight into imbalanced data characteristics by López et al [36] and discussion on new perspectives for evaluation classifiers on skewed datasets [42] deserve mentioning.…”

Section: Introductionmentioning

confidence: 99%

Learning from imbalanced data: open challenges and future directions

2016

View full text Add to dashboard Cite

Despite more than two decades of continuous development learning from imbalanced data is still a focus of intense research. Starting as a problem of skewed distributions of binary tasks, this topic evolved way beyond this conception. With the expansion of machine learning and data mining, combined with the arrival of big data era, we have gained a deeper insight into the nature of imbalanced learning, while at the same time facing new emerging challenges. Data-level and algorithm-level methods are constantly being improved and hybrid approaches gain increasing popularity. Recent trends focus on analyzing not only the disproportion between classes, but also other difficulties embedded in the nature of data. New real-life problems motivate researchers to focus on computationally efficient, adaptive and real-time methods. This paper aims at discussing open issues and challenges that need to be addressed to further develop the field of imbalanced learning. Seven vital areas of research in this topic are identified, covering the full spectrum of learning from imbalanced data: classification, regression, clustering, data streams, big data analytics and applications, e.g., in social media and computer vision. This paper provides a discussion and suggestions concerning lines of future research for each of them.

show abstract

Section: Introductionmentioning

confidence: 99%

Learning from imbalanced data: open challenges and future directions

2016

View full text Add to dashboard Cite

show abstract

“…Thus, since the class distribution does harm the learning processes as it extremely diverges from the balanced one [27], it is immediate to use a distance/similarity function, d (⇣, e), between both the empirical and balanced distributions, ⇣ and e, to summarise the degree of skewness of a classification problem K . Here, stands for any chosen distance between vectors or divergence between probability distributions which can be found in the literature.…”

Section: Imbalance-degreementioning

confidence: 99%

“…There, each row corresponds to a dataset and each column stands for a characteristic (name, features and number of classes) or a summary (empirical class distribution, number of occurrences, IR and IDs). Afterwards, each dataset is used to feed a representative learning algorithm from the traditional major learning paradigms [27]. Specifically, for each problem, a di↵erent classifier is learnt using 5 di↵erent popular supervised algorithms 4 : C4.5 (Decision trees), RIPPER (Decision rules), Neural Networks (Connectionism), Naïve Bayes (Probabilistic), and SVM (Statistical learning).…”

Section: Study 2: Sensitivity and Validity Of Imbalance-degreementioning

confidence: 99%

“…However, although this experimental setup is reasonable enough to support an argument that the new method is "as good as" or "better than" the state-ofthe-art, it still leaves many unanswered questions [27]. Besides, it is costly in computing time [30].…”

Section: Introductionmentioning

confidence: 99%

“…d : ⇣ 7 ! and which are somehow correlated with the hindrance that skewed class distributions cause on learning algorithms mainly dominate the class-imbalance literature [11,27]. Regarding the summaries, imbalance-ratio (IR) between the majority and minority classes is, to the best of our knowledge, the only summary for ⇣ used for multi-class problems.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Measuring the class-imbalance extent of multi-class problems

Ortigosa-Hernández

Inza

Lozano

2017

Pattern Recognition Letters

View full text Add to dashboard Cite

Cost‐efficiency disk failure prediction via threshold‐moving

Jiang

Huang

Zhou

2020

Concurrency and Computation

View full text Add to dashboard Cite

SummarySelf‐Monitoring, Analysis, and Reporting Technology (SMART) is a technology in hard disk drives to predict impending disk failures for data repair in advance. As the prediction accuracy of SMART is unsatisfactory, recently, machine learning techniques have been explored to improve the prediction accuracy. Those approaches treat disk failure prediction as a binary classification problem and take SMART attributes as features, and some of them achieve satisfactory prediction accuracy. However, there is no uniform metric to measure the financial impact of these methods whose primary objective is to reduce disk failure recovery costs via disk failure prediction. In this article, from a financial impact perspective, we propose a simple, yet practical, metric Mean‐Cost‐To‐Recovery (MCTR) for disk failure prediction in data centers. Specifically, by assigning different weights to mispredicted healthy disks and failed disks, we measure the entire misprediction costs, that is, MCTR. In addition, we argue that the commonly used threshold 0.5 for disk failure prediction is suboptimal because of the fact of data imbalance, that is, failed disks are much fewer than healthy ones. To find the optimal threshold which renders minimal MCTR, we wrap a cost‐minimizing procedure around disk failure prediction and use a threshold‐moving technique for searching. Moreover, to map sample‐level prediction results to disk‐level prediction results, a modified leaky‐bucket algorithm is design to determine the disk health state by considering its multiple sample‐level prediction results. To evaluate the effectiveness of our approach, we conduct extensive experiments using three real‐world datasets. The experimental results show that compared with reactive data protection schemes, we can reduce MCTR by up to 86.9%, and compared with cost‐blind failure predictions, we can reduce MCTR by up to 22.3%.

show abstract

Class imbalance revisited: a new experimental setup to assess the performance of treatment methods

Cited by 154 publications

References 27 publications

Learning from imbalanced data: open challenges and future directions

Learning from imbalanced data: open challenges and future directions

Measuring the class-imbalance extent of multi-class problems

Cost‐efficiency disk failure prediction via threshold‐moving

Contact Info

Product

Resources

About