The impact of data difficulty factors on classification of imbalanced and concept drifting data streams

Brzeziński, Dariusz; Minku, Leandro L.; Pewinski, Tomasz; Stefanowski, Jerzy; Szumaczuk, Artur

doi:10.1007/s10115-021-01560-w

Cited by 35 publications

(33 citation statements)

References 78 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To evaluate the classifiers in specific and controlled scenarios, we prepared data streams generators under different imbalanced and drifting settings. Nine generators in MOA (Bifet et al, 2010b) plus one generator proposed by Brzeziński (Brzeziński et al, 2021) were used. Those generators are presented in Table 3, with their number of attributes, classes, and whether they can generate internal concept drifts.…”

Section: Generatorsmentioning

confidence: 99%

“…Goal of the experiment. This experiment addresses RQ3 and evaluates the robustness of the classifiers to instance-level difficulties (Brzeziński et al, 2021). We evaluated the Brzeziński generator with borderline or rare instances, and combining both at the same time.…”

Section: Instance-level Difficultiesmentioning

confidence: 99%

“…Check for instance-level difficulties. Instance-level difficulties in data streams pose significant difficulties to most of the classifiers (Brzeziński et al, 2021). It is crucial to analyze your stream to understand if such factors are present.…”

Section: Recommendationsmentioning

confidence: 99%

“…An experimental analysis of oversampling in drifting streams was presented by Bernardo and Della Valle (2022), focusing on C-SMOTE and how it can benefit selected classifiers. In Brzeziński et al (2021) authors examined the interactions between the global class imbalance ratio, local data difficulty factors, and concept drift, using five classifiers. While these works are an important first step, there is a need for a unified and holistic study of learning from imbalanced data streams that could be used as a template for researchers to evaluate their newly proposed algorithms.…”

mentioning

confidence: 99%

See 3 more Smart Citations

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

Aguiar¹,

Krawczyk²,

Cano³

2022

Preprint

View full text Add to dashboard Cite

Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures on how to evaluate these algorithms. This work presents a taxonomy of algorithms for imbalanced data streams and proposes a standardized, exhaustive, and informative experimental testbed to evaluate algorithms in a collection of diverse and challenging imbalanced data stream scenarios. The experimental study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced data streams that combine static and dynamic class imbalance ratios, instance-level difficulties, concept drift, real-world and semi-synthetic datasets in binary and multi-class scenarios. This leads to the largest experimental study conducted so far in the data stream mining domain. We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced data streams. Additionally, we formulate open challenges and future directions for this domain. Our experimental testbed is fully reproducible and easy to extend with new methods. This way we propose first standardized approach to conducting experiments in imbalanced data streams that can be used by other researchers to create trustworthy and fair evaluation of newly proposed methods. Our experimental framework can be downloaded from https://github.com/canoalberto/imbalanced-streams.

show abstract

Section: Generatorsmentioning

confidence: 99%

Section: Instance-level Difficultiesmentioning

confidence: 99%

Section: Recommendationsmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

Aguiar¹,

Krawczyk²,

Cano³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…There are already many papers dealing with the challenge of class imbalance for the classification task [10,19,20]. Many difficulty factors for this problem have been investigated [10,20] and various solutions proposed [21]. However, none of these solutions is based on hypergraph modeling of categorical features and focused on optimizing the random undersampling.…”

Section: Related Workmentioning

confidence: 99%

Hypergraph-based data modeling for binary classification

Misiorek

Janowski

2022

Preprint

View full text Add to dashboard Cite

We present the novel hypergraph-based framework enabling to assess the importance of data elements for the binary classification task. The proposed method explores the classification datasets modeled as a hypergraph with vertices denoting data samples and hyperedges corresponding to values of categorical features used to describe these samples. The structure of the hypergraph is applied to rate data samples and values of categorical features from the perspective of their relevance to binary classification labels. The proposed Hypergraph-based Importance (HI) ratings are theoretically-grounded on the hypergraph cut conductance minimization concept. As a result of using a hypergraph representation, which is regarded as a lossless representation from the perspective of higher-order relationships in data, our approach enables more precise exploitation of the information on feature and sample coincidences. The solution has been tested using two experimentation scenarios: random undersampling for imbalanced classification data based on an adaptive algorithm using the HI rates for data samples, and feature selection based on the HI rates for values of categorical features. The experimentation results have proven the good quality of the new approach when compared with other state-of-the-art and baseline methods for both scenarios.

show abstract

Classification of Multi-class Imbalanced Data: Data Difficulty Factors and Selected Methods for Improving Classifiers

Stefanowski

2021

Rough Sets

View full text Add to dashboard Cite

The impact of data difficulty factors on classification of imbalanced and concept drifting data streams

Cited by 35 publications

References 78 publications

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

Hypergraph-based data modeling for binary classification

Classification of Multi-class Imbalanced Data: Data Difficulty Factors and Selected Methods for Improving Classifiers

Contact Info

Product

Resources

About