A survey on addressing high-class imbalance in big data

Leevy, Joffrey L.; Khoshgoftaar, Taghi M.; Bauder, Richard A.; Seliya, Naeem

doi:10.1186/s40537-018-0151-6

Cited by 491 publications

(266 citation statements)

References 65 publications

(157 reference statements)

Supporting

Mentioning

255

Contrasting

Unclassified

Order By: Relevance

“…The latter refers to some difficulties that appear when the number of samples in one or more classes in the dataset is fewer than another class (or classes), thereby producing an important deterioration of the classifier performance [16]. In the literature, many studies dealing with this problem have been reported [17]; in particular, the data sampling methods such as Random Over-Sampling (ROS), which replicate samples from the minority class, and Random Under-Sampling (RUS), which eliminate samples from the majority class. These methods bias the discrimination process to compensate the class imbalance ratio [18].…”

Section: Introductionmentioning

confidence: 99%

Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem

et al. 2020

View full text Add to dashboard Cite

The class imbalance problem has been a hot topic in the machine learning community in recent years. Nowadays, in the time of big data and deep learning, this problem remains in force. Much work has been performed to deal to the class imbalance problem, the random sampling methods (over and under sampling) being the most widely employed approaches. Moreover, sophisticated sampling methods have been developed, including the Synthetic Minority Over-sampling Technique (SMOTE), and also they have been combined with cleaning techniques such as Editing Nearest Neighbor or Tomek's Links (SMOTE+ENN and SMOTE+TL, respectively). In the big data context, it is noticeable that the class imbalance problem has been addressed by adaptation of traditional techniques, relatively ignoring intelligent approaches. Thus, the capabilities and possibilities of heuristic sampling methods on deep learning neural networks in big data domain are analyzed in this work, and the cleaning strategies are particularly analyzed. This study is developed on big data, multi-class imbalanced datasets obtained from hyper-spectral remote sensing images. The effectiveness of a hybrid approach on these datasets is analyzed, in which the dataset is cleaned by SMOTE followed by the training of an Artificial Neural Network (ANN) with those data, while the neural network output noise is processed with ENN to eliminate output noise; after that, the ANN is trained again with the resultant dataset. Obtained results suggest that best classification outcome is achieved when the cleaning strategies are applied on an ANN output instead of input feature space only. Consequently, the need to consider the classifier's nature when the classical class imbalance approaches are adapted in deep learning and big data scenarios is clear.

show abstract

Section: Introductionmentioning

confidence: 99%

Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Since the data points are selected based on their relevance to the classification task, the resultant reduced training set is much more balanced in size across the target classes. In other words, the formulation addresses the problem statement of class imbalance, which is a topic of current research in big data [24].…”

Section: Introductionmentioning

confidence: 99%

A graphical heuristic for reduction and partitioning of large datasets for scalable supervised training

Yadav¹,

Bode

2019

J Big Data

View full text Add to dashboard Cite

A scalable graphical method is presented for selecting, and partitioning datasets for the training phase of a classification task. For the heuristic, a clustering algorithm is required to get its computation cost in a reasonable proportion to the task itself. This step is proceeded by construction of an information graph of the underlying classification patterns using approximate nearest neighbor methods. The presented method constitutes of two approaches, one for reducing a given training set, and another for partitioning the selected/reduced set. The heuristic targets large datasets, since the primary goal is significant reduction in training computation run-time without compromising prediction accuracy. Test results show that both approaches significantly speed-up the training task when compared against that of state-of-the-art shrinking heuristic available in LIBSVM. Furthermore, the approaches closely follow or even outperform in prediction accuracy. A network design is also presented for the partitioning based distributed training formulation. Added speed-up in training run-time is observed when compared to that of serial implementation of the approaches.

show abstract

“…Dataset rarity is associated with insignificant numbers of positive instances [4], e.g., the occurrence of 25 fraudulent transactions among 1,000,000 normal transactions within a financial security dataset of a reputable bank. Since many multi-class problems can be simplified by binary classification, data scientists frequently take the binary approach for analytics [5]. The minority (positive) class, which accounts for a smaller percentage of the dataset, is often the class of interest in real-world problems [5].…”

mentioning

confidence: 99%

“…Since many multi-class problems can be simplified by binary classification, data scientists frequently take the binary approach for analytics [5]. The minority (positive) class, which accounts for a smaller percentage of the dataset, is often the class of interest in real-world problems [5]. The majority (negative) class constitutes the larger percentage.…”

mentioning

confidence: 99%

“…Machine learning algorithms generally outperform traditional statistical techniques at classification [6][7][8], but these algorithms cannot effectively distinguish between majority and minority classes if the dataset suffers from severe class imbalance or rarity. Severely imbalanced data, also known as high-class imbalance, is often defined by majority-tominority class ratios between 100:1 and 10,000:1 [5]. The failure to sufficiently distinguish between majority and minority classes is akin to searching for a proverbial polar bear in a snowstorm and could cause the classifier to label almost all instances as the majority (negative) class, thereby producing an accuracy performance metric value that is deceptively high.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Severely imbalanced Big Data challenges: investigating data sampling approaches

et al. 2019

Self Cite

View full text Add to dashboard Cite

IntroductionThe exponential increase of raw data in recent years has been associated with technological advances in the fields of Data Mining (DM) and Machine Learning (ML) [1,2]. These advances have significantly improved the efficiency and effectiveness of Big Data applications in a diverse range of areas, such as knowledge discovery and information processing. Big Data is identified by various data-related properties, and for this reason, an exact definition of Big Data remains elusive. One definition, presented by Senthilkumar et al. [3], relates Big Data to six V's: Volume, Variety, Velocity, Veracity, Variability, and Value. Volume is associated with the reams of data produced by an organization. AbstractSevere class imbalance between majority and minority classes in Big Data can bias the predictive performance of Machine Learning algorithms toward the majority (negative) class. Where the minority (positive) class holds greater value than the majority (negative) class and the occurrence of false negatives incurs a greater penalty than false positives, the bias may lead to adverse consequences. Our paper incorporates two case studies, each utilizing three learners, six sampling approaches, two performance metrics, and five sampled distribution ratios, to uniquely investigate the effect of severe class imbalance on Big Data analytics. The learners (Gradient-Boosted Trees, Logistic Regression, Random Forest) were implemented within the Apache Spark framework. The first case study is based on a Medicare fraud detection dataset. The second case study, unlike the first, includes training data from one source (SlowlorisBig Dataset) and test data from a separate source (POST dataset). Results from the Medicare case study are not conclusive regarding the best sampling approach using Area Under the Receiver Operating Characteristic Curve and Geometric Mean performance metrics. However, it should be noted that the Random Undersampling approach performs adequately in the first case study. For the SlowlorisBig case study, Random Undersampling convincingly outperforms the other five sampling approaches (Random Oversampling, ADAptive SYNthetic) when measuring performance with Area Under the Receiver Operating Characteristic Curve and Geometric Mean metrics. Based on its classification performance in both case studies, Random Undersampling is the best choice as it results in models with a significantly smaller number of samples, thus reducing computational burden and training time. which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

show abstract

A survey on addressing high-class imbalance in big data

Cited by 491 publications

References 65 publications

Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem

Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem

A graphical heuristic for reduction and partitioning of large datasets for scalable supervised training

Severely imbalanced Big Data challenges: investigating data sampling approaches

Contact Info

Product

Resources

About