Investigating class rarity in big data

Hasanin, Tawfiq; Khoshgoftaar, Taghi M.; Leevy, Joffrey L.; Bauder, Richard A.

doi:10.1186/s40537-020-00301-0

Cited by 15 publications

(10 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The application of GBDT algorithms for classification and regression tasks to many types of Big Data is well studied [ 11 – 13 ]. To the best of our knowledge, this is the first survey specifically dedicated to the CatBoost implementation of ’s.…”

Section: Introductionmentioning

confidence: 99%

“…For example, Spark MLlib’s GradientBoostedTrees module, [ 15 ], is one such implementation. For examples of GBDT applications in Spark please see [ 16 ] and [ 11 ] . However, as long as the distributed framework supports a language that the Gradient Boosted Decision Tree implementation has an application programming interface available for, it is possible to use that implementation in the framework; thus, freeing the user to select from the most appealing GBDT implementation available.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

CatBoost for big data: an interdisciplinary review

2020

Self Cite

View full text Add to dashboard Cite

Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost’s effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

CatBoost for big data: an interdisciplinary review

2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…Future work with the CSE-CIC-IDS2018 dataset can investigate other families of attacks, individual web attack labels (as compared to the combined web attack labels used in this study), and the effects of rarity [50]. Other datasets could also be included for future work, as well as additional performance metrics, classifiers, and sampling techniques.…”

Section: Discussionmentioning

confidence: 99%

Detecting web attacks using random undersampling and ensemble learners

2021

Self Cite

View full text Add to dashboard Cite

Class imbalance is an important consideration for cybersecurity and machine learning. We explore classification performance in detecting web attacks in the recent CSE-CIC-IDS2018 dataset. This study considers a total of eight random undersampling (RUS) ratios: no sampling, 999:1, 99:1, 95:5, 9:1, 3:1, 65:35, and 1:1. Additionally, seven different classifiers are employed: Decision Tree (DT), Random Forest (RF), CatBoost (CB), LightGBM (LGB), XGBoost (XGB), Naive Bayes (NB), and Logistic Regression (LR). For classification performance metrics, Area Under the Receiver Operating Characteristic Curve (AUC) and Area Under the Precision-Recall Curve (AUPRC) are both utilized to answer the following three research questions. The first question asks: “Are various random undersampling ratios statistically different from each other in detecting web attacks?” The second question asks: “Are different classifiers statistically different from each other in detecting web attacks?” And, our third question asks: “Is the interaction between different classifiers and random undersampling ratios significant for detecting web attacks?” Based on our experiments, the answers to all three research questions is “Yes”. To the best of our knowledge, we are the first to apply random undersampling techniques to web attacks from the CSE-CIC-IDS2018 dataset while exploring various sampling ratios.

show abstract

“…Future work can explore Naive Bayes and its noteworthy classification performance when no sampling is applied under conditions of severe class imbalance and rarity (as well as its insensitivity to improvements when applying RUS). Other datasets could also be included for future work, as well as additional performance metrics, families of attacks, classifiers, sampling techniques, and rarity levels [56].…”

Section: Discussionmentioning

confidence: 99%

Investigating rarity in web attacks with ensemble learners

2021

Self Cite

View full text Add to dashboard Cite

Class rarity is a frequent challenge in cybersecurity. Rarity occurs when the positive (attack) class only has a small number of instances for machine learning classifiers to train upon, thus making it difficult for the classifiers to discriminate and learn from the positive class. To investigate rarity, we examine three individual web attacks in big data from the CSE-CIC-IDS2018 dataset: “Brute Force-Web”, “Brute Force-XSS”, and “SQL Injection”. These three individual web attacks are also severely imbalanced, and so we evaluate whether random undersampling (RUS) treatments can improve the classification performance for these three individual web attacks. The following eight different levels of RUS ratios are evaluated: no sampling, 999:1, 99:1, 95:5, 9:1, 3:1, 65:35, and 1:1. For measuring classification performance, Area Under the Receiver Operating Characteristic Curve (AUC) metrics are obtained for the following seven different classifiers: Random Forest (RF), CatBoost (CB), LightGBM (LGB), XGBoost (XGB), Decision Tree (DT), Naive Bayes (NB), and Logistic Regression (LR) (with the first four learners being ensemble learners and for comparison, the last three being single learners). We find that applying random undersampling does improve overall classification performance with the AUC metric in a statistically significant manner. Ensemble learners achieve the top AUC scores after massive undersampling is applied, but the ensemble learners break down and have poor performance (worse than NB and DT) when no sampling is applied to our unique and harsh experimental conditions of severe class imbalance and rarity.

show abstract

Investigating class rarity in big data

Cited by 15 publications

References 27 publications

CatBoost for big data: an interdisciplinary review

CatBoost for big data: an interdisciplinary review

Detecting web attacks using random undersampling and ensemble learners

Investigating rarity in web attacks with ensemble learners

Contact Info

Product

Resources

About