Nowadays, the solution to many practical problems relies on machine learning tools. However, compiling the appropriate training data set for real-world classification problems is challenging because collecting the right amount of data for each class is often difficult or even impossible. In such cases, we can easily face the problem of imbalanced learning. There are many methods in the literature for solving the imbalanced learning problem, so it has become a serious question how to compare the performance of the imbalanced learning methods. Inadequate validation techniques can provide misleading results (e.g., due to data shift), which leads to the development of methods designed for imbalanced data sets, such as stratified cross-validation (SCV) and distribution optimally balanced SCV (DOB-SCV). Previous studies have shown that higher classification performance scores (AUC) can be achieved on imbalanced data sets using DOB-SCV instead of SCV. We investigated the effect of the oversamplers on this difference. The study was conducted on 420 data sets, involving several sampling methods and the DTree, kNN, SVM, and MLP classifiers. We point out that DOB-SCV often provides a little higher F1 and AUC values for classification combined with sampling. However, the results also prove that the selection of the sampler–classifier pair is more important for the classification performance than the choice between the DOB-SCV and the SCV techniques.
In daily routine the reticulin silver staining is used on bone marrow biopsy samples as a gold standard for the characterization of myelofibrosis, however this method does not provide information about the prefibrotic stage. Recently a specific immunohistochemical method was introduced which may overcome these weaknesses of reticulin staining. Activated fibroblasts responsible for stromal proliferation are highlighted by increased PDGFR β expression, which can be presented by immunohistochemistry in bone marrow samples. Using this staining the pre-fibrotic stage can become detectable and we have information about the disease activity. During development of new staining method it is important to prove its reliability and usability. In this paper we introduce a digital image processing method to measure paranchymal damage in digitalized histological slides that can aid correct interpretation of the staining.
Data imbalance is a serious problem in machine learning that can be alleviated at the data level by balancing the class distribution with sampling. In the last decade, several sampling methods have been published to address the shortcomings of the initial ones, such as noise sensitivity and incorrect neighbor selection. Based on the review of the literature, it has become clear to us that the algorithms achieve varying performance on different data sets. In this paper, we present a new oversampler that has been developed based on the key steps and sampling strategies identified by analyzing dozens of existing methods and that can be fitted to various data sets through an optimization process. Experiments were performed on a number of data sets, which show that the proposed method had a similar or better effect on the performance of SVM, DTree, kNN and MLP classifiers compared with other well-known samplers found in the literature. The results were also confirmed by statistical tests.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.