2013
DOI: 10.7763/ijmlc.2013.v3.307
|View full text |Cite
|
Sign up to set email alerts
|

Addressing the Class Imbalance Problem in Medical Datasets

Abstract: Abstract-A well balanced dataset is very important for creating a good prediction model. Medical datasets are often not balanced in their class labels. Most existing classification methods tend to perform poorly on minority class examples when the dataset is extremely imbalanced. This is because they aim to optimize the overall accuracy without considering the relative distribution of each class. In this paper we examine the performance of over-sampling and under-sampling techniques to balance cardiovascular d… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
147
0
2

Year Published

2015
2015
2021
2021

Publication Types

Select...
9

Relationship

0
9

Authors

Journals

citations
Cited by 296 publications
(158 citation statements)
references
References 11 publications
(18 reference statements)
1
147
0
2
Order By: Relevance
“…To compensate for dropouts from the panel, who were no longer included in the study population, over a timespan of years, the oversample method was used. This allows the initial sampling to be respected, provided that the initial population is known and that the statistical processing, and weights attributed, are different between the groups that make up each panel dropout situation (cases of death, change of address without being able to identify the new address, long term travel, prolonged hospitalization, and entry into a long-term care institution) 11 . Age, sex, and education level were variables selected to delimit the entry of new subjects.…”
Section: Methodsmentioning
confidence: 99%
“…To compensate for dropouts from the panel, who were no longer included in the study population, over a timespan of years, the oversample method was used. This allows the initial sampling to be respected, provided that the initial population is known and that the statistical processing, and weights attributed, are different between the groups that make up each panel dropout situation (cases of death, change of address without being able to identify the new address, long term travel, prolonged hospitalization, and entry into a long-term care institution) 11 . Age, sex, and education level were variables selected to delimit the entry of new subjects.…”
Section: Methodsmentioning
confidence: 99%
“…The author in this paper [8] said that Down-sizing the majority class results in a loss of information that may result in overly general rules. In order to overcome this drawback of the under-sampling approach Yen and Lee (2009) proposed an unsupervised learning technique for supervised learning called cluster based under-sampling.…”
Section: Related Workmentioning
confidence: 99%
“…Drummond and Holte showed that random under-sampling yields better minority prediction than random over-sampling [47]. More recently, Rahman and Davis showed that the class imbalance problem in medical datasets could be addressed with a new clusteringbased under-sampling approach where cluster centers can be used to choose the sample's representatives for the majority class data [48]. Furthermore, while over-sampling would have allowed to both increase the size of the dataset and have a more representative set of the non-spiculated cases, we were concerned about the applicability of over-sampling in real settings.…”
Section: Spiculation Classificationmentioning
confidence: 99%
“…Furthermore, it will be interesting to explore how the results generalize for larger datasets given that the LIDC data contains only 77 spiculated nodules and a random under-sampling procedure was used to generate balanced datasets of spiculated and non-spiculated nodules. Finally, we will explore addressing the class imbalance problem in the LIDC dataset using the new clustering-based under-sampling approach technique proposed by Rahman and Davis [48].…”
mentioning
confidence: 99%