“…Data-centric machine learning comprises a series of tasks, including standardization and normalization, data cleaning, feature extraction, dimensionality reduction, feature transformation, instance selection, undersampling, data synthesis, and oversampling 27 . However, even recognizing the importance of data-centric methods, the challenge is to find an appropriate balance between these and model-centric methods to provide a robust machine learning solution 28 . This paper aims to present a data-centric approach applied to The Cancer Genome Atlas (TCGA) data set and explore the potential benefits of oversampling and undersampling algorithms to address class imbalance, thus comparing their performance with that of six machine learning models (k nearest neighbors, support vector machine, multi-layer perceptron, logistic regression, random forest, and CatBoost).…”