2018
DOI: 10.1186/s40537-018-0151-6
|View full text |Cite
|
Sign up to set email alerts
|

A survey on addressing high-class imbalance in big data

Abstract: Any dataset with unequal distribution between its majority and minority classes can be considered to have class imbalance, and in real-world applications, the severity of class imbalance can vary from minor to severe (high or extreme). A dataset can be considered imbalanced if the classes, e.g., fraud and non-fraud cases, are not equally represented. The majority class makes up most of the dataset, whereas the minority class, with limited dataset representation, is often considered the class of interest. With … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

4
255
0
4

Year Published

2019
2019
2023
2023

Publication Types

Select...
7
1

Relationship

1
7

Authors

Journals

citations
Cited by 491 publications
(266 citation statements)
references
References 65 publications
(157 reference statements)
4
255
0
4
Order By: Relevance
“…The latter refers to some difficulties that appear when the number of samples in one or more classes in the dataset is fewer than another class (or classes), thereby producing an important deterioration of the classifier performance [16]. In the literature, many studies dealing with this problem have been reported [17]; in particular, the data sampling methods such as Random Over-Sampling (ROS), which replicate samples from the minority class, and Random Under-Sampling (RUS), which eliminate samples from the majority class. These methods bias the discrimination process to compensate the class imbalance ratio [18].…”
Section: Introductionmentioning
confidence: 99%
“…The latter refers to some difficulties that appear when the number of samples in one or more classes in the dataset is fewer than another class (or classes), thereby producing an important deterioration of the classifier performance [16]. In the literature, many studies dealing with this problem have been reported [17]; in particular, the data sampling methods such as Random Over-Sampling (ROS), which replicate samples from the minority class, and Random Under-Sampling (RUS), which eliminate samples from the majority class. These methods bias the discrimination process to compensate the class imbalance ratio [18].…”
Section: Introductionmentioning
confidence: 99%
“…Since the data points are selected based on their relevance to the classification task, the resultant reduced training set is much more balanced in size across the target classes. In other words, the formulation addresses the problem statement of class imbalance, which is a topic of current research in big data [24].…”
Section: Introductionmentioning
confidence: 99%
“…Dataset rarity is associated with insignificant numbers of positive instances [4], e.g., the occurrence of 25 fraudulent transactions among 1,000,000 normal transactions within a financial security dataset of a reputable bank. Since many multi-class problems can be simplified by binary classification, data scientists frequently take the binary approach for analytics [5]. The minority (positive) class, which accounts for a smaller percentage of the dataset, is often the class of interest in real-world problems [5].…”
mentioning
confidence: 99%
“…Since many multi-class problems can be simplified by binary classification, data scientists frequently take the binary approach for analytics [5]. The minority (positive) class, which accounts for a smaller percentage of the dataset, is often the class of interest in real-world problems [5]. The majority (negative) class constitutes the larger percentage.…”
mentioning
confidence: 99%
See 1 more Smart Citation