An Empirical Study on the Joint Impact of Feature Selection and Data Re-sampling on Imbalance Classification

Zhang, Chongsheng; Soda, Paolo; Bi, Jingjun; Fan, Gaojuan; Almpanidis, George; Marti-Garcia, S.

doi:10.48550/arxiv.2109.00201

Cited by 1 publication

(2 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The level of peakedness or non-Gaussian behavior in the frequency domain. This value was calculated for the frequency bands [0,1.5] Hz and [1,4] Hz…”

Section: Spectral Kurtosismentioning

confidence: 99%

“…The problem of class imbalance arises when some classes (or categories) have significantly smaller number of samples compared to others, leading to a model that is less likely to detect those minority classes due to the insufficient number of samples in the training set needed for proper learning. This problem presents itself in various domains and applications including but not limited to security, finance, environment, agriculture, and health (1)(2)(3)(4). Typically, class imbalance is mitigated either at the model level by adapting and adjusting the training procedure based on the different data samples and training progression, or at the data level by modifying the class distributions in such a way as to allow for improved class separability, typically via resampling (5)(6)(7).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A comparative study in class imbalance mitigation when working with physiological signals

Abdulsadig,

Rodriguez-Villegas

2024

Front. Digit. Health

View full text Add to dashboard Cite

Class imbalance is a common challenge that is often faced when dealing with classification tasks aiming to detect medical events that are particularly infrequent. Apnoea is an example of such events. This challenge can however be mitigated using class rebalancing algorithms. This work investigated 10 widely used data-level class imbalance mitigation methods aiming towards building a random forest (RF) model that attempts to detect apnoea events from photoplethysmography (PPG) signals acquired from the neck. Those methods are random undersampling (RandUS), random oversampling (RandOS), condensed nearest-neighbors (CNNUS), edited nearest-neighbors (ENNUS), Tomek’s links (TomekUS), synthetic minority oversampling technique (SMOTE), Borderline-SMOTE (BLSMOTE), adaptive synthetic oversampling (ADASYN), SMOTE with TomekUS (SMOTETomek) and SMOTE with ENNUS (SMOTEENN). Feature-space transformation using PCA and KernelPCA was also examined as a potential way of providing better representations of the data for the class rebalancing methods to operate. This work showed that RandUS is the best option for improving the sensitivity score (up to 11%). However, it could hinder the overall accuracy due to the reduced amount of training data. On the other hand, augmenting the data with new artificial data points was shown to be a non-trivial task that needs further development, especially in the presence of subject dependencies, as was the case in this work.

show abstract

“…The level of peakedness or non-Gaussian behavior in the frequency domain. This value was calculated for the frequency bands [0,1.5] Hz and [1,4] Hz…”

Section: Spectral Kurtosismentioning

confidence: 99%