Selecting the Suitable Resampling Strategy for Imbalanced Data Classification Regarding Dataset Properties. An Approach Based on Association Models

Kraiem, Mohamed S.; Sánchez, F.; García, María N. Moreno

doi:10.3390/app11188546

Cited by 24 publications

(13 citation statements)

References 68 publications

(88 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The selection of the best re-sampling technique is complicated. Since the effectiveness of re-sampling techniques depends on intrinsic properties of the dataset, such as dataset size and dimensionality, imbalance ratio, overlapping between classes or borderline samples (41). In the present study, the majority class and minority class have close properties, such as both of them are clinical T1 stage lung adenocarcinoma.…”

Section: Discussionmentioning

confidence: 72%

Imbalanced Data Correction Based PET/CT Radiomics Model for Predicting Lymph Node Metastasis in Clinical Stage T1 Lung Adenocarcinoma

Chen

Liu

et al. 2022

Front. Oncol.

View full text Add to dashboard Cite

ObjectivesTo develop and validate the imbalanced data correction based PET/CT radiomics model for predicting lymph node metastasis (LNM) in clinical stage T1 lung adenocarcinoma (LUAD).MethodsA total of 183 patients (148/35 non-metastasis/LNM) with pathologically confirmed LUAD were retrospectively included. The cohorts were divided into training vs. validation cohort in a ratio of 7:3. A total of 487 radiomics features were extracted from PET and CT components separately for radiomics model construction. Four clinical features and seven PET/CT radiological features were extracted for traditional model construction. To balance the distribution of majority (non-metastasis) class and minority (LNM) class, the imbalance-adjustment strategies using ten data re-sampling methods were adopted. Three multivariate models (denoted as Traditional, Radiomics, and Combined) were constructed using multivariable logistic regression analysis, where the combined model incorporated all of the significant clinical, radiological, and radiomics features. One hundred times repeated Monte Carlo cross-validation was used to assess the application order of feature selection and imbalance-adjustment strategies in the machine learning pipeline. Prediction performance of each model was evaluated using the area under the receiver operating characteristic curve (AUC) and Geometric mean score (G-mean).ResultsA total of 2 clinical parameters, 2 radiological features, 3 PET, and 5 CT radiomics features were significantly associated with LNM. The combined model with Edited Nearest Neighbors (ENN) re-sampling methods showed strong prediction performance than traditional model or radiomics model with the AUC of 0.94 (95%CI = 0.86–0.97) vs. 0.89 (95%CI = 0.79–0.93), 0.92 (95%CI = 0.85–0.97), and G-mean of 0.88 vs. 0.82, 0.80 in the training cohort, and the AUC of 0.75 (95%CI = 0.57–0.91) vs. 0.68 (95%CI = 0.36–0.83), 0.71 (95%CI = 0.48–0.83) and G-mean of 0.76 vs. 0.64, 0.51 in the validation cohort. The combination of performing feature selection before data re-sampling obtains a better result than the reverse combination (AUC 0.76 ± 0.06 vs. 0.70 ± 0.07, p<0.001).ConclusionsThe combined model (consisting of age, histological type, C/T ratio, MATV, and radiomics signature) integrated with ENN re-sampling methods had strong lymph node metastasis prediction performance for imbalance cohorts in clinical stage T1 LUAD. Radiomics signatures extracted from PET/CT images could provide complementary prediction information compared with traditional model.

show abstract

Section: Discussionmentioning

confidence: 72%

Imbalanced Data Correction Based PET/CT Radiomics Model for Predicting Lymph Node Metastasis in Clinical Stage T1 Lung Adenocarcinoma

Chen

Liu

et al. 2022

Front. Oncol.

View full text Add to dashboard Cite

show abstract

“…Imbalanced dataset is a common problem in machine learning classification. This imbalanced data can prevent the machine learning algorithms from building accurate models for these minority classes and lead to prediction errors [25,34]. For example, Sim-pleLogistic worked better than decision tree with sampling methods for the datasets of obstetrics and gynecology and urology, but not for neurosurgery dataset.…”

Section: Discussionmentioning

confidence: 99%

“…For example, Sim-pleLogistic worked better than decision tree with sampling methods for the datasets of obstetrics and gynecology and urology, but not for neurosurgery dataset. There are several methods to solve this problem of imbalanced data, such as resampling the datasets by under-sampling the majority class and over-sampling the minority class, modifying algorithms, and considering a different perspective, such as anomaly [24,25,34]. We used two resampling approaches (Bagging and AdaBoost) to overcome the problem of imbalanced dataset.…”

Section: Discussionmentioning

confidence: 99%

Applying Machine Learning Techniques to the Audit of Antimicrobial Prophylaxis

et al. 2022

View full text Add to dashboard Cite

High rates of inappropriate use of surgical antimicrobial prophylaxis were reported in many countries. Auditing the prophylactic antimicrobial use in enormous medical records by manual review is labor-intensive and time-consuming. The purpose of this study is to develop accurate and efficient machine learning models for auditing appropriate surgical antimicrobial prophylaxis. The supervised machine learning classifiers (Auto-WEKA, multilayer perceptron, decision tree, SimpleLogistic, Bagging, and AdaBoost) were applied to an antimicrobial prophylaxis dataset, which contained 601 instances with 26 attributes. Multilayer perceptron, SimpleLogistic selected by Auto-WEKA, and decision tree algorithms had outstanding discrimination with weighted average AUC > 0.97. The Bagging and SMOTE algorithms could improve the predictive performance of decision tree against imbalanced datasets. Although with better performance measures, multilayer perceptron and Auto-WEKA took more execution time as compared with that of other algorithms. Multilayer perceptron, SimpleLogistic, and decision tree algorithms have outstanding performance measures for identifying the appropriateness of surgical prophylaxis. The efficient models developed by machine learning can be used to assist the antimicrobial stewardship team in the audit of surgical antimicrobial prophylaxis. In future research, we still have the challenges and opportunities of enriching our datasets with more useful clinical information to improve the performance of the algorithms.

show abstract

“…One of them is the health area since it is a rich data source, including electronic medical records, administrative reports, and medical imaging among others (11,12) . There are numerous studies in the literature, including from our group (13)(14)(15)(16) , in which different machine learning algorithms have been used for various purposes such as the automation of medical diagnosis, and prediction of mortality or treatment outcomes.…”

Section: Introductionmentioning

confidence: 99%

Predictors of the post-stroke status in the discharge from the hospital. Importance in nursing

Vico¹,

Sánchez

Mesonero

et al. 2023

Enf Global

View full text Add to dashboard Cite

Nurses are often asked to predict factors that influence post-stroke outcome by the patient and family. Many studies have been carried out in order to determine the factors that influence the neurological status of the post-stroke patient at the moment of the discharge from the hospital. However, machine learning techniques have not been used for this purpose. Therefore, with the objective of obtaining association rules of neurological prognosis, a double analysis, both clinical and with machine learning techniques of the possible associations of factors that influence the neurological status of the post-stroke patients has been carried out. The Apriori algorithm detected several association rules with high confidence (≥ 95%), from which the following pattern: In patients in the age range of 50-80 years, the association of a NIHSS between 11 and 15 points (intermediate/low NIHSS), along with thrombectomy, leads to recovery ad integrum at discharge. With the SMOTE resampling technique, the 100% confidence was reached for the association of high NIHSS (>20) and involvement of the carotid and basilar arteries, with a dire prognosis (exitus). These rules confirm, for the first time with machine learning, the importance of the association of some predictors, in the post-stroke prognosis. The knowledge by the nurses of these association rules can successfully improve stroke outcome. In addition, the role of nurses in education programs that teach knowledge of risk factors and stroke prognosis becomes essential. A menudo, por parte del paciente y de la familia, se solicita a los profesionales de enfermería que predigan los factores que influyen en el estado post-ictus. Se han realizado numerosos estudios para determinar los factores que influyen en el estado neurológico post-ictus en el momento del alta hospitalaria. Sin embargo, las técnicas de aprendizaje automático no se han utilizado para este propósito. Con el objetivo de obtener reglas de asociación del pronóstico neurológico, se ha llevado a cabo un doble análisis, tanto clínico como con técnicas de aprendizaje automático, de las posibles asociaciones de factores que influyen en el estado neurológico de los pacientes post-ictus. El algoritmo Apriori detectó varias reglas de asociación con alta confianza (≥ 95%), con el siguiente patrón: En pacientes en el rango de edad de 50-80 años, la asociación de un NIHSS entre 11 y 15 puntos (NIHSS intermedio/bajo), junto con la trombectomía, conduce a la recuperación ad integrum al alta. Con la técnica de remuestreo SMOTE, se alcanzó el 100% de confianza para la asociación de NIHSS elevado (>20) y afectación de las arterias carótida y basilar, con pronóstico nefasto (exitus). Estas reglas confirman, por primera vez con aprendizaje automático, la importancia de la asociación de algunos predictores, en el pronóstico post-ictus. El conocimiento por parte de las enfermeras de estas reglas puede mejorar los resultados del ictus. Adicionalmente, el papel de la enfermería en los programas de educación sobre los factores de riesgo, y pronóstico de un ictus se torna imprescindible.

show abstract

Selecting the Suitable Resampling Strategy for Imbalanced Data Classification Regarding Dataset Properties. An Approach Based on Association Models

Cited by 24 publications

References 68 publications

Imbalanced Data Correction Based PET/CT Radiomics Model for Predicting Lymph Node Metastasis in Clinical Stage T1 Lung Adenocarcinoma

Imbalanced Data Correction Based PET/CT Radiomics Model for Predicting Lymph Node Metastasis in Clinical Stage T1 Lung Adenocarcinoma

Applying Machine Learning Techniques to the Audit of Antimicrobial Prophylaxis

Predictors of the post-stroke status in the discharge from the hospital. Importance in nursing

Contact Info

Product

Resources

About