An Empirical Study on Class Rarity in Big Data

Bauder, Richard A.; Khoshgoftaar, Taghi M.; Hasanin, Tawfiq

doi:10.1109/icmla.2018.00125

Cited by 32 publications

(18 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Finally, in [5], the impact of class rarity on big data is evaluated. The researchers use publicly available Medicare data and map known fraudulent providers, from the List of Excluded Individuals/Entities (LEIE) [23], as labels for the positive class.…”

Section: Related Workmentioning

confidence: 99%

“…Various degrees of class imbalance exist, ranging from slightly imbalanced to rarity. Class rarity in a dataset is defined by comparatively inconsequential numbers of positive instances [5], e.g., the occurrence of 10 fraudulent transactions out of 1,000,000 total transactions generated daily for a bank. Binary classification is usually associated with class imbalance since many multi-class classification problems can be managed by breaking down the data into multiple binary classification tasks.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Investigating class rarity in big data

et al. 2020

Self Cite

View full text Add to dashboard Cite

IntroductionWhen called upon to define big data, researchers and practitioners in the field of data science frequently refer to the six V's: volume, variety, velocity, variability, value, and veracity [1]. Volume, most certainly the best-known property of big data, is associated with the profusion of data produced by an organization. Variety covers the handling of structured, unstructured, and semi-structured data. Velocity takes into account how quickly data is manufactured, issued, and dealt with. Variability refers to the fluctuations in data. Value is often regarded as a critical attribute because it is required for effective decision-making. Veracity is associated with the fidelity of data. AbstractIn Machine Learning, if one class has a significantly larger number of instances (majority) than the other (minority), this condition is defined as class imbalance. With regard to datasets, class imbalance can bias the predictive capabilities of Machine Learning algorithms towards the majority (negative) class, and in situations where false negatives incur a greater penalty than false positives, this imbalance may lead to adverse consequences. Our paper incorporates two case studies, each utilizing a unique approach of three learners (gradient-boosted trees, logistic regression, random forest) and three performance metrics (Area Under the Receiver Operating Characteristic Curve, Area Under the Precision-Recall Curve, Geometric Mean) to investigate class rarity in big data. Class rarity, a notably extreme degree of class imbalance, was effected in our experiments by randomly removing minority (positive) instances to artificially generate eight subsets of gradually decreasing positive class instances. All model evaluations were performed through Cross-Validation. In the first case study, which uses a Medicare Part B dataset, performance scores for the learners generally improve with the Area Under the Receiver Operating Characteristic Curve metric as the rarity level decreases, while corresponding scores with the Area Under the Precision-Recall Curve and Geometric Mean metrics show no improvement. In the second case study, which uses a dataset built from Distributed Denial of Service attack attack data (POSTSlowloris Combined), the Area Under the Receiver Operating Characteristic Curve metric produces very high-performance scores for the learners, with all subsets of positive class instances. For the second study, scores for the learners generally improve with the Area Under the Precision-Recall Curve and Geometric Mean metrics as the rarity level decreases. Overall, with regard to both case studies, the Gradient-Boosted Trees (GBT) learner performs the best.

show abstract

Section: Related Workmentioning

confidence: 99%

mentioning

confidence: 99%

Investigating class rarity in big data

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…For both RF and GBT, which share several similar parameter settings, the number of trees generated in the training process was set to 100 [6,51]. The Cache Node Ids was set to True and the maximum memory in megabytes (MB) was set to 1024 for speeding up the tree-building process.…”

Section: Classifiersmentioning

confidence: 99%

“…There are various degrees of class imbalance, ranging from slightly imbalanced to rarity. Rarity in a dataset involves comparatively inconsequential numbers of positive instances [6], e.g., the occurrence of 40 fraudulent transactions within an insurance claims dataset of 1,000,000 normal transactions. Binary classification is frequently utilized to focus on class imbalance because many non-binary (i.e., multi-class) classification problems can be addressed by transforming the given data into multiple binary classification tasks.…”

mentioning

confidence: 99%

Examining characteristics of predictive models with imbalanced big data

et al. 2019

Self Cite

View full text Add to dashboard Cite

“…The spectrum of class imbalance ranges from "slightly imbalanced" to "rarity. " Dataset rarity is associated with insignificant numbers of positive instances [4], e.g., the occurrence of 25 fraudulent transactions among 1,000,000 normal transactions within a financial security dataset of a reputable bank. Since many multi-class problems can be simplified by binary classification, data scientists frequently take the binary approach for analytics [5].…”

mentioning

confidence: 99%

Severely imbalanced Big Data challenges: investigating data sampling approaches

et al. 2019

Self Cite

View full text Add to dashboard Cite

IntroductionThe exponential increase of raw data in recent years has been associated with technological advances in the fields of Data Mining (DM) and Machine Learning (ML) [1,2]. These advances have significantly improved the efficiency and effectiveness of Big Data applications in a diverse range of areas, such as knowledge discovery and information processing. Big Data is identified by various data-related properties, and for this reason, an exact definition of Big Data remains elusive. One definition, presented by Senthilkumar et al. [3], relates Big Data to six V's: Volume, Variety, Velocity, Veracity, Variability, and Value. Volume is associated with the reams of data produced by an organization. AbstractSevere class imbalance between majority and minority classes in Big Data can bias the predictive performance of Machine Learning algorithms toward the majority (negative) class. Where the minority (positive) class holds greater value than the majority (negative) class and the occurrence of false negatives incurs a greater penalty than false positives, the bias may lead to adverse consequences. Our paper incorporates two case studies, each utilizing three learners, six sampling approaches, two performance metrics, and five sampled distribution ratios, to uniquely investigate the effect of severe class imbalance on Big Data analytics. The learners (Gradient-Boosted Trees, Logistic Regression, Random Forest) were implemented within the Apache Spark framework. The first case study is based on a Medicare fraud detection dataset. The second case study, unlike the first, includes training data from one source (SlowlorisBig Dataset) and test data from a separate source (POST dataset). Results from the Medicare case study are not conclusive regarding the best sampling approach using Area Under the Receiver Operating Characteristic Curve and Geometric Mean performance metrics. However, it should be noted that the Random Undersampling approach performs adequately in the first case study. For the SlowlorisBig case study, Random Undersampling convincingly outperforms the other five sampling approaches (Random Oversampling, ADAptive SYNthetic) when measuring performance with Area Under the Receiver Operating Characteristic Curve and Geometric Mean metrics. Based on its classification performance in both case studies, Random Undersampling is the best choice as it results in models with a significantly smaller number of samples, thus reducing computational burden and training time. which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

show abstract

An Empirical Study on Class Rarity in Big Data

Cited by 32 publications

References 29 publications

Investigating class rarity in big data

Investigating class rarity in big data

Examining characteristics of predictive models with imbalanced big data

Severely imbalanced Big Data challenges: investigating data sampling approaches

Contact Info

Product

Resources

About