Background: Predicting lung adenocarcinoma (LUAD) and Lung Squamous Cell Carcinoma (LUSC) risk cohorts is a crucial step in precision oncology. Currently, clinicians and patients are informed about the patient's risk group via staging in the clinic. Several machine learning approaches have been carried out on the stratification of LUAD and LUSC patients, but there is no study assessing the integrated training of both clinical data and genetic data of these two lung cancer types.
Methods: We initially implemented five different machine learning algorithms (Support Vector Machine, Logistic Regression, Naive Bayes, Random Forest, and K Neighbors Classifiers) to evaluate the clinical and mutated genes of patients to develop a prognostic relevance model to classify LUAD and LUSC patients into high-risk and low-risk groups.
Results: We identified a list of clinical features and somatically mutated genes that may be used to evaluate the prognosis of LUAD and LUSC patients for patient risk stratification in a clinical setting. As a result of this analysis, new genes such as KEAP1 for LUAD and CSMD3 for LUSC with others can be added to clinical decision processes.
Conclusions: In current clinical practice, clinicians, and patients are informed about the patient's risk group only with cancer staging. With the feature set we propose, clinicians and patients can assess the risk group of their patients according to the patient-specific clinical and molecular parameters. Our machine learning model may serve as a practical and reliable prognosis predictive tool for LUAD and LUSC and could provide novel insights into the understanding of the underlying clinical and molecular mechanisms of LUAD and LUSC.
Keywords: Machine Learning, Lung Adenocarcinoma, Lung Squamous Cell Carcinoma, Prognosis Prediction Model, TCGA, Multi-omics, Data Integration