Handling data irregularities in classification: Foundations, trends, and future challenges

Das, Swagatam; Datta, Siddhartha; Chaudhuri, B. B.

doi:10.1016/j.patcog.2018.03.008

Cited by 168 publications

(70 citation statements)

References 108 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, it should be mentioned that the theoretical advantages were built upon some regularity assumption of the data such as the independent identically distributed distribution. When faced with data of irregularities, such as the class imbalance, small disjuncts, and class distribution skew [10], the theoretical advantages may not be still hold and the algorithm itself should be modified. We refer the reader to the nice review [10] on modifying classifier for data of irregularities.…”

Section: Conclusion and Discussionmentioning

confidence: 99%

“…When faced with data of irregularities, such as the class imbalance, small disjuncts, and class distribution skew [10], the theoretical advantages may not be still hold and the algorithm itself should be modified. We refer the reader to the nice review [10] on modifying classifier for data of irregularities. We will keep working on the study of using RBoosting to generate classifier for data of irregularities and report our progress in a future publication.…”

Section: Conclusion and Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Rescaled Boosting in Classification

Wang

Liao

Lin

2019

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

Boosting is a learning scheme that combines weak learners to produce a strong composite learner, with the underlying intuition that one can obtain accurate learner by combining "rough" ones. This paper aims at developing a new boosting strategy, called rescaled boosting (RBoosting), to accelerate the numerical convergence rate and, consequently, improve learning performances of the original boosting. Our studies show that RBoosting possesses the almost optimal numerical convergence rate in the sense that, up to a logarithmic factor, it can reach the minimax nonlinear approximation rate. We then use RBoosting to tackle classification problems and deduce corresponding statistical consistency and tight generalization error estimates. A series of theoretical and experimental results shows that RBoosting outperforms boosting in terms of generalization.

show abstract

Section: Conclusion and Discussionmentioning

confidence: 99%

Section: Conclusion and Discussionmentioning

confidence: 99%

Rescaled Boosting in Classification

Wang

Liao

Lin

2019

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

show abstract

“…Most of the learning and classification methods used in building such ID models are based on a number of key assumptions [2,3], such as: (i) the equal representation of classes, (ii) the equal representation of sub concepts for a specific class, (iii) the similar class conditional distributions of all classes, and (iv) the pre defining and knowledge of all the values of the attributes for all records in the dataset. Due to the traffic evolution, most, if not all, of these assumptions are violated in real environments, as new traffic will start to exhibit different statistical properties to those of the training data.…”

Section: Problem Statementmentioning

confidence: 99%

Improving Intrusion Detection Model Prediction by Threshold Adaptation

Tobi

Duncan

2019

Information

View full text Add to dashboard Cite

Network traffic exhibits a high level of variability over short periods of time. This variability impacts negatively on the accuracy of anomaly-based network intrusion detection systems (IDS) that are built using predictive models in a batch learning setup. This work investigates how adapting the discriminating threshold of model predictions, specifically to the evaluated traffic, improves the detection rates of these intrusion detection models. Specifically, this research studied the adaptability features of three well known machine learning algorithms: C5.0, Random Forest and Support Vector Machine. Each algorithm’s ability to adapt their prediction thresholds was assessed and analysed under different scenarios that simulated real world settings using the prospective sampling approach. Multiple IDS datasets were used for the analysis, including a newly generated dataset (STA2018). This research demonstrated empirically the importance of threshold adaptation in improving the accuracy of detection models when training and evaluation traffic have different statistical properties. Tests were undertaken to analyse the effects of feature selection and data balancing on model accuracy when different significant features in traffic were used. The effects of threshold adaptation on improving accuracy were statistically analysed. Of the three compared algorithms, Random Forest was the most adaptable and had the highest detection rates.

show abstract

“…The first type of class that is under-presented with fewer instances than others because of the rare events, abnormal patterns, unusual behaviours, or interruptions during gathering of data is known as the minority, while the remaining class/classes that have an abundant number of instances are named as majority [3]. Figure 1 maps the types of imbalanced data [4], frequently suggested solutions in the literature [5], assessment metrics to evaluate effectiveness of these solutions [6], and widespread real-world applications of imbalance data [3].…”

Section: Introductionmentioning

confidence: 99%

“…For instance, detecting an attack is more important than detecting normal traffic, or diagnosing the disease is more critical than diagnosing health. Class imbalanced problem is typically handled in three ways: under/oversampling, modifying algorithm, and reducing misclassification cost [5]. However, these approaches have several limitations, such as working well on small data, having more computing and storage costs because of algorithm complexity, being slow by algorithm's nature, handling either binary-class or multi-class problems, and requiring predefined threshold values.…”

Section: Introductionmentioning

confidence: 99%

A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem

Terzi

Sağıroğlu

2019

Applied Computer Systems

View full text Add to dashboard Cite

The class imbalance problem, one of the common data irregularities, causes the development of under-represented models. To resolve this issue, the present study proposes a new cluster-based MapReduce design, entitled Distributed Cluster-based Resampling for Imbalanced Big Data (DIBID). The design aims at modifying the existing dataset to increase the classification success. Within the study, DIBID has been implemented on public datasets under two strategies. The first strategy has been designed to present the success of the model on data sets with different imbalanced ratios. The second strategy has been designed to compare the success of the model with other imbalanced big data solutions in the literature. According to the results, DIBID outperformed other imbalanced big data solutions in the literature and increased area under the curve values between 10 % and 24 % through the case study.

show abstract

Handling data irregularities in classification: Foundations, trends, and future challenges

Cited by 168 publications

References 108 publications

Rescaled Boosting in Classification

Rescaled Boosting in Classification

Improving Intrusion Detection Model Prediction by Threshold Adaptation

A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem

Contact Info

Product

Resources

About