Addressing the Class Imbalance Problem in Medical Datasets

Rahman, M. Mostafizur; Davis, Darryl N.

doi:10.7763/ijmlc.2013.v3.307

Cited by 296 publications

(158 citation statements)

References 11 publications

(18 reference statements)

Supporting

Mentioning

147

Contrasting

Unclassified

Order By: Relevance

“…To compensate for dropouts from the panel, who were no longer included in the study population, over a timespan of years, the oversample method was used. This allows the initial sampling to be respected, provided that the initial population is known and that the statistical processing, and weights attributed, are different between the groups that make up each panel dropout situation (cases of death, change of address without being able to identify the new address, long term travel, prolonged hospitalization, and entry into a long-term care institution) 11 . Age, sex, and education level were variables selected to delimit the entry of new subjects.…”

Section: Methodsmentioning

confidence: 99%

Prevalence of fear of falling, in a sample of elderly adults in the community

Cruz

Duque

Leite

2017

Rev. bras. geriatr. gerontol.

View full text Add to dashboard Cite

Objectives: To investigate the prevalence of fear of falling among a sample of elderly persons in the community, and to analyze its correlation with age, self-perceived health, difficulty walking, use of an assistive device for walking, history of falls, and functional capacity. Method: A cross-sectional study of 314 non-institutionalized elderly individuals, living in the city of Juiz de Fora in the state of Minas Gerais) in 2015, was carried out. A household survey was conducted and fear of falling was assessed using the Falls Efficacy Scale -International -Brazil (FES-I-BR ASIL). The Spearman correlation was used to verify the correlation of the independent variables with the fear of falling. The significance level for the study was 5%. Results: The prevalence of fear of falling among the elderly was 95.2% (95% CI= 92.3; 97.3). Fear of falling was significantly correlated with all the variables analyzed: age (r= 0.199), self-perceived health (r=0.299), difficulty walking (r= -0.480), use of an assistive device for walking (r=0.337), history of falls (r= -0.177), and functional capacity (r = -0.476). Conclusions: A high prevalence of fear of falling was observed, with a significant correlation between the outcome and the variables studied. These findings point to the need for rehabilitation, prevention, and health promotion strategies that enable healthy aging.

show abstract

Section: Methodsmentioning

confidence: 99%

Prevalence of fear of falling, in a sample of elderly adults in the community

Cruz

Duque

Leite

2017

Rev. bras. geriatr. gerontol.

View full text Add to dashboard Cite

show abstract

“…The author in this paper [8] said that Down-sizing the majority class results in a loss of information that may result in overly general rules. In order to overcome this drawback of the under-sampling approach Yen and Lee (2009) proposed an unsupervised learning technique for supervised learning called cluster based under-sampling.…”

Section: Related Workmentioning

confidence: 99%

A Review on Imbalanced Data Handling Using Undersampling and Oversampling Technique

2017

IJRTER

View full text Add to dashboard Cite

Abstract:In today's era of internet the amount of data generation is growing on increasing. Some of the data related to medical, e-commerce, social networking, etc. are of great importance. But many of these datasets are imbalanced that is some records belonging to same category are in much large number and some are very rare. For extracting useful date from such large dataset different data mining or machine learning techniques are used. But these imbalanced nature of the datasets affects the performance of a classifier very greatly. To deal with this it is necessary to understand the problem of imbalanced learning. There are various Undersampling and oversampling techniques available which try to resolve imbalanced learning problem. This paper, performs the study of this imbalance nature of the datasets and different techniques of oversampling and Undersampling that are used to balance the datasets.

show abstract

“…Drummond and Holte showed that random under-sampling yields better minority prediction than random over-sampling [47]. More recently, Rahman and Davis showed that the class imbalance problem in medical datasets could be addressed with a new clusteringbased under-sampling approach where cluster centers can be used to choose the sample's representatives for the majority class data [48]. Furthermore, while over-sampling would have allowed to both increase the size of the dataset and have a more representative set of the non-spiculated cases, we were concerned about the applicability of over-sampling in real settings.…”

Section: Spiculation Classificationmentioning

confidence: 99%

“…Furthermore, it will be interesting to explore how the results generalize for larger datasets given that the LIDC data contains only 77 spiculated nodules and a random under-sampling procedure was used to generate balanced datasets of spiculated and non-spiculated nodules. Finally, we will explore addressing the class imbalance problem in the LIDC dataset using the new clustering-based under-sampling approach technique proposed by Rahman and Davis [48].…”

mentioning

confidence: 99%

Toward Understanding the Size Dependence of Shape Features for Predicting Spiculation in Lung Nodules for Computer-Aided Diagnosis

et al. 2015

View full text Add to dashboard Cite

We analyze the importance of shape features for predicting spiculation ratings assigned by radiologists to lung nodules in computed tomography (CT) scans. Using the Lung Image Database Consortium (LIDC) data and classification models based on decision trees, we demonstrate that the importance of several shape features increases disproportionately relative to other image features with increasing size of the nodule. Our shaped-based classification results show an area under the receiver operating characteristic (ROC) curve of 0.65 when classifying spiculation for small nodules and an area of 0.91 for large nodules, resulting in a 26 % difference in classification performance using shape features. An analysis of the results illustrates that this change in performance is driven by features that measure boundary complexity, which perform well for large nodules but perform relatively poorly and do no better than other features for small nodules. For large nodules, the roughness of the segmented boundary maps well to the semantic concept of spiculation. For small nodules, measuring directly the complexity of hard segmentations does not yield good results for predicting spiculation due to limits imposed by spatial resolution and the uncertainty in boundary location. Therefore, a wider range of features, including shape, texture, and intensity features, are needed to predict spiculation ratings for small nodules. A further implication is that the efficacy of shape features for a particular classifier used to create computer-aided diagnosis systems depends on the distribution of nodule sizes in the training and testing sets, which may not be consistent across different research studies.

show abstract

Addressing the Class Imbalance Problem in Medical Datasets

Cited by 296 publications

References 11 publications

Prevalence of fear of falling, in a sample of elderly adults in the community

Prevalence of fear of falling, in a sample of elderly adults in the community

A Review on Imbalanced Data Handling Using Undersampling and Oversampling Technique

Toward Understanding the Size Dependence of Shape Features for Predicting Spiculation in Lung Nodules for Computer-Aided Diagnosis

Contact Info

Product

Resources

About