SMOTE for high-dimensional class-imbalanced data

Blagus, Rok; Lusa, Lara

doi:10.1186/1471-2105-14-106

Cited by 688 publications

(460 citation statements)

References 34 publications

Supporting

Mentioning

408

Contrasting

Order By: Relevance

“…The results were ranked based on total cost, and a cost-benefit analysis was performed to see if costs could be reduced. In general, as indicated in the literature, under-sampling seems to work better than over-sampling and SMOTE [22]. The authors recommend the usage of random under-sampling as a solution for class imbalanced datasets because it is also computationally less expensive to implement than SMOTE or over-sampling.…”

Section: Resultsmentioning

confidence: 93%

“…SMOTE is also computationally expensive to implement when compared to sampling methods like random under-sampling [21]. However, other experiments have proved that simple under-sampling tends to outperform SMOTE in most situations [22]. The performance of classifiers implementing SMOTE has been found to vary based on the number of dimensions in the training dataset [22].…”

Section: Learning From Class Imbalanced Datasetsmentioning

confidence: 99%

“…However, other experiments have proved that simple under-sampling tends to outperform SMOTE in most situations [22]. The performance of classifiers implementing SMOTE has been found to vary based on the number of dimensions in the training dataset [22]. Smart re-sampling can be deployed instead of cost-sensitive learning as they can provide new information or eliminate redundant information for the learning algorithm [11].…”

Section: Learning From Class Imbalanced Datasetsmentioning

confidence: 99%

See 2 more Smart Citations

Learning from a Class Imbalanced Public Health Dataset: a Cost-based Comparison of Classifier Performance

Rao

Makkithaya

2017

IJECE

View full text Add to dashboard Cite

Public health care systems routinely collect health-related data from the population. This data can be analyzed using data mining techniques to find novel, interesting patterns, which could help formulate effective public health policies and interventions. The occurrence of chronic illness is rare in the population and the effect of this class imbalance, on the performance of various classifiers was studied. The objective of this work is to identify the best classifiers for class imbalanced health datasets through a cost-based comparison of classifier performance. The popular, opensource data mining tool WEKA, was used to build a variety of core classifiers as well as classifier ensembles, to evaluate the classifiers" performance. The unequal misclassification costs were represented in a cost matrix, and cost-benefit analysis was also performed. In another experiment, various sampling methods such as under-sampling, over-sampling, and SMOTE was performed to balance the class distribution in the dataset, and the costs were compared. The Bayesian classifiers performed well with a high recall, low number of false negatives and were not affected by the class imbalance. Results confirm that total cost of Bayesian classifiers can be further reduced using cost-sensitive learning methods. Classifiers built using the random under-sampled dataset showed a dramatic drop in costs and high classification accuracy.

show abstract

Section: Resultsmentioning

confidence: 93%

Section: Learning From Class Imbalanced Datasetsmentioning

confidence: 99%

Section: Learning From Class Imbalanced Datasetsmentioning

confidence: 99%

See 1 more Smart Citation

Learning from a Class Imbalanced Public Health Dataset: a Cost-based Comparison of Classifier Performance

Rao

Makkithaya

2017

IJECE

View full text Add to dashboard Cite

show abstract

“…This observation needs to be explored in our future work; we also plan to consider sampling methods [5,6,11,15,33,42,102].…”

Section: Figure 68 a Concept Showing Point Correspondence (A) Querymentioning

confidence: 99%

“…Under-sampling removes some instances of the majority class and thus may lead to a loss of information, whereas over-sampling generates artificial samples for the minority class. The various techniques for handling an imbalance are addressed in [5,6,11,15,33,42,102].…”

Section: Future Workmentioning

confidence: 99%

Exploring Mediatoil Imagery: A Content-based Approach

Saroop¹,

Viktor

McCurdy

et al. 2017

Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management

View full text Add to dashboard Cite

The future of Alberta's bitumen sands, also known as "oil sands" or "tar sands," and their place in Canada's energy future has become a topic of much public debate. Within this debate, the print, television, and social media campaigns of those who both support and oppose developing the oil sands are particularly visible. As such, campaigns around the oil sands may be seen as influencing audience perceptions of the benefits and drawbacks of oil sands production. There is consequently a need to study the media materials of various tar sands stakeholders and explore how they differ. In this setting, it is essential to gather documents and identify content within images, which requires the use of an image retrieval technique such as a content-based image retrieval (CBIR) system. In a CBIR system, images are represented by low-level features (i.e. specific structures in the image such as points, edges, or objects), which are used to distinguish pictures from one another.The oil sands domain has to date not been mapped using CBIR systems. The research thus focuses on creating an image retrieval system, namely Mediatoil-IR, for exploring documents related to the oil sands. Our aim is to evaluate various low-level representations of the images within this context. To this end, our experimental framework employs LAB color histogram (LAB) and speeded up robust features (SURF) in order to typify the imagery. We further use machine learning techniques to improve the quality of retrieval (in terms of both accuracy and speed). To achieve this aim, the extracted features from each image are encoded in the form of vectors and used as a training set for learning classification models to organize pictures into different categories. Different algorithms were considered such as Linear SVM, Quadratic SVM, Weighted KNN, Decision Trees, Bagging, and Boosting on trees. It was shown that Quadratic SVM algorithm trained on SURF features is a good approach for building CBIR, and is used in building Mediatoil-IR.Finally, with the help of created CBIR, we were able to extract the similar documents and explore the different types of imagery used by different stakeholders. Our experimental evaluation shows that our Mediatoil-IR system is able to accurately explore the imagery used by different stakeholders.iii

show abstract

Assessment of a Machine Learning Model Applied to Harmonized Electronic Health Record Data for the Prediction of Incident Atrial Fibrillation

et al. 2020

View full text Add to dashboard Cite

IMPORTANCE Atrial fibrillation (AF) is the most common sustained cardiac arrhythmia, and its early detection could lead to significant improvements in outcomes through the appropriate prescription of anticoagulation medication. Although a variety of methods exist for screening for AF, a targeted approach, which requires an efficient method for identifying patients at risk, would be preferred. OBJECTIVE To examine machine learning approaches applied to electronic health record data that have been harmonized to the Observational Medical Outcomes Partnership Common Data Model for identifying risk of AF. DESIGN, SETTING, AND PARTICIPANTS This diagnostic study used data from 2 252 219 individuals cared for in the UCHealth hospital system, which comprises 3 large hospitals in Colorado, from January 1, 2011, to October 1, 2018. Initial analysis was performed in December 2018; follow-up analysis was performed in July 2019. EXPOSURES All Observational Medical Outcomes Partnership Common Data Model-harmonized electronic health record features, including diagnoses, procedures, medications, age, and sex. MAIN OUTCOMES AND MEASURES Classification of incident AF in designated 6-month intervals, adjudicated retrospectively, based on area under the receiver operating characteristic curve and F1 statistic. RESULTS Of 2 252 219 individuals (1 225 533 [54.4%] women; mean [SD] age, 42.9 [22.3] years), 28 036 (1.2%) developed incident AF during a designated 6-month interval. The machine learning model that used the 200 most common electronic health record features, including age and sex, and random oversampling with a single-layer, fully connected neural network provided the optimal prediction of 6-month incident AF, with an area under the receiver operating characteristic curve of 0.800 and an F1 score of 0.110. This model performed only slightly better than a more basic logistic regression model composed of known clinical risk factors for AF, which had an area under the receiver operating characteristic curve of 0.794 and an F1 score of 0.079. CONCLUSIONS AND RELEVANCE Machine learning approaches to electronic health record data offer a promising method for improving risk prediction for incident AF, but more work is needed to show improvement beyond standard risk factors.

show abstract

SMOTE for high-dimensional class-imbalanced data

Cited by 688 publications

References 34 publications

Learning from a Class Imbalanced Public Health Dataset: a Cost-based Comparison of Classifier Performance

Learning from a Class Imbalanced Public Health Dataset: a Cost-based Comparison of Classifier Performance

Exploring Mediatoil Imagery: A Content-based Approach

Assessment of a Machine Learning Model Applied to Harmonized Electronic Health Record Data for the Prediction of Incident Atrial Fibrillation

Contact Info

Product

Resources

About