Instance selection improves geometric mean accuracy: a study on imbalanced data classification

Kuncheva, Ludmila I.; Arnaiz‐González, Álvar; Díez-Pastor, José-Francisco; Gunn, Iain A. D.

doi:10.1007/s13748-019-00172-4

Cited by 58 publications

(22 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The experiments in the first block were performed on artificial data sets taken from the paper by Napierala et al (2010) because using synthetic data allows us to know their characteristics a priori and analyze the effects of resampling in a fully controlled environment. The second group of experiments was on a well-known benchmark suite of real-life databases widely used for class imbalance problems (Chen et al, 2019;Jing et al, 2019;Kovács, 2019;Kuncheva et al, 2019;Lopez-Garcia et al, 2019), which are all available at the KEEL database repository (Alcalá-Fdez et al, 2011). The results of both experiments were estimated by 5-fold stratified cross-validation in order to have a sufficient amount of positive examples in the test partitions.…”

Section: Methodsmentioning

confidence: 99%

Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data

García

Sánchez

Marqués

et al. 2020

Expert Systems with Applications

View full text Add to dashboard Cite

Data plays a key role in the design of expert and intelligent systems and therefore, data preprocessing appears to be a critical step to produce high-quality data and build accurate machine learning models. Over the past decades, increasing attention has been paid towards the issue of class imbalance and this is now a research hotspot in a variety of fields. Although the resampling methods, either by undersampling the majority class or by over-sampling the minority class, stand among the most powerful techniques to face this problem, their strengths and weaknesses have typically been discussed based only on the class imbalance ratio. However, several questions remain open and need further exploration. For instance, the subtle differences in performance between the over-and under-sampling algorithms are still under-comprehended, and we hypothesize that they could be better explained by analyzing the inner structure of the data sets. Consequently, this paper attempts to investigate and illustrate the effects of the resampling methods on the inner structure of a data set by exploiting local neighborhood information, identifying the sample types in both classes and analyzing their distribution in each resampled set. Experimental results indicate that the resampling methods that pro

show abstract

Section: Methodsmentioning

confidence: 99%

Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data

García

Sánchez

Marqués

et al. 2020

Expert Systems with Applications

View full text Add to dashboard Cite

show abstract

“…For a binary classification problem, the classification performance is typically measured by the geometric mean (G-Mean) of the true-positive and the true-negative rates [ 35 ]. G-Mean is a measure for imbalanced classification that can be optimized to achieve a balance between sensitivity and specificity.…”

Section: Covid-19 Detection Analysismentioning

confidence: 99%

Robust Detection of COVID-19 in Cough Sounds

Mouawad

Dubnov²,

Dubnov

2021

SN COMPUT. SCI.

View full text Add to dashboard Cite

, otherwise known as the coronavirus, has precipitated the world into a pandemic that has infected, as of the time of writing, more than 10 million persons worldwide and caused the death of more than 500,000 persons. Early symptoms of the virus include trouble breathing, fever and fatigue and over 60% of people experience a dry cough. Due to the devastating impact of COVID-19 and the tragic loss of lives, it is of the utmost urgency to develop methods for the early detection of the disease that may help limit its spread as well as aid in the development of targeted solutions. Coughs and other vocal sounds contain pulmonary health information that can be used for diagnostic purposes, and recent studies in chaotic dynamics have shown that nonlinear phenomena exist in vocal signals. The present work investigates the use of symbolic recurrence quantification measures with MFCC features for the automatic detection of COVID-19 in cough sounds of healthy and sick individuals. Our performance evaluation reveals that our symbolic dynamics measures capture the complex dynamics in the vocal sounds and are highly effective at discriminating sick and healthy coughs. We apply our method to sustained vowel 'ah' recordings, and show that our model is robust for the detection of the disease in sustained vowel utterances as well. Furthermore, we introduce a robust novel method of informative undersampling using information rate to deal with the imbalance in our dataset, due to the unavailability of an equal number of sick and healthy recordings. The proposed model achieves a mean classification performance of 97% and 99%, and a mean F 1 -score of 91% and 89% after optimization, for coughs and sustained vowels, respectively.

show abstract

“…In the last step of the above procedure, the ACC, SEN, SPE, and GM [ 33 ] are defined by

where

, and ‘TP’ and ‘TN’ are short for ‘True Positive’ and ‘True Negative’, respectively.…”

Section: Numerical Resultsmentioning

confidence: 99%

A Shape Approximation for Medical Imaging Data

Huang

Wen

Chu

et al. 2020

Sensors

View full text Add to dashboard Cite

This study proposes a shape approximation approach to portray the regions of interest (ROI) from medical imaging data. An effective algorithm to achieve an optimal approximation is proposed based on the framework of Particle Swarm Optimization. The convergence of the proposed algorithm is derived under mild assumptions on the selected family of shape equations. The issue of detecting Parkinson’s disease (PD) based on the Tc-99m TRODAT-1 brain SPECT/CT images of 634 subjects, with 305 female and an average age of 68.3 years old from Kaohsiung Chang Gung Memorial Hospital, Taiwan, is employed to demonstrate the proposed procedure by fitting optimal ellipse and cashew-shaped equations in the 2D and 3D spaces, respectively. According to the visual interpretation of 3 experienced board-certified nuclear medicine physicians, 256 subjects are determined to be abnormal, 77 subjects are potentially abnormal, 174 are normal, and 127 are nearly normal. The coefficients of the ellipse and cashew-shaped equations, together with some well-known features of PD existing in the literature, are employed to learn PD classifiers under various machine learning approaches. A repeated hold-out with 100 rounds of 5-fold cross-validation and stratified sampling scheme is adopted to investigate the classification performances of different machine learning methods and different sets of features. The empirical results reveal that our method obtains 0.88 ±0.04 classification accuracy, 0.87 ±0.06 sensitivity, and 0.88 ±0.08 specificity for test data when including the coefficients of the ellipse and cashew-shaped equations. Our findings indicate that more constructive and useful features can be extracted from proper mathematical representations of the 2D and 3D shapes for a specific ROI in medical imaging data, which shows their potential for improving the accuracy of automated PD identification.

show abstract

Instance selection improves geometric mean accuracy: a study on imbalanced data classification

Cited by 58 publications

References 44 publications

Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data

Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data

Robust Detection of COVID-19 in Cough Sounds

A Shape Approximation for Medical Imaging Data

Contact Info

Product

Resources

About