Predicting lung adenocarcinoma (LUAD) and Lung Squamous Cell Carcinoma (LUSC) risk status is a crucial step in precision oncology. In current clinical practice, clinicians, and patients are informed about the patient's risk group only with cancer staging. Several machine learning approaches for stratifying LUAD and LUSC patients have recently been described, however, there has yet to be a study that compares the integrated modeling of clinical and genetic data from these two lung cancer types. In our work, we used a prognostic prediction model based on clinical and somatically altered gene features from 1026 patients to assess the relevance of features based on their impact on risk classification. By integrating the clinical features and somatically mutated genes of patients, we achieved the highest accuracy; 93% for LUAD and 89% for LUSC, respectively. Our second finding is that new prognostic genes such as KEAP1 for LUAD and CSMD3 for LUSC and new clinical factors such as the site of resection are significantly associated with the risk stratification and can be integrated into clinical decision making. We validated the most important features found on an independent RNAseq dataset from NCBI GEO with survival information (GSE81089) and integrated our model into a user-friendly mobile application. Using this machine learning model and mobile application, clinicians and patients can assess the survival risk of their patients using each patient’s own clinical and molecular feature set.
Background: Predicting lung adenocarcinoma (LUAD) and Lung Squamous Cell Carcinoma (LUSC) risk cohorts is a crucial step in precision oncology. Currently, clinicians and patients are informed about the patient's risk group via staging in the clinic. Several machine learning approaches have been carried out on the stratification of LUAD and LUSC patients, but there is no study assessing the integrated training of both clinical data and genetic data of these two lung cancer types.
Methods: We initially implemented five different machine learning algorithms (Support Vector Machine, Logistic Regression, Naive Bayes, Random Forest, and K Neighbors Classifiers) to evaluate the clinical and mutated genes of patients to develop a prognostic relevance model to classify LUAD and LUSC patients into high-risk and low-risk groups.
Results: We identified a list of clinical features and somatically mutated genes that may be used to evaluate the prognosis of LUAD and LUSC patients for patient risk stratification in a clinical setting. As a result of this analysis, new genes such as KEAP1 for LUAD and CSMD3 for LUSC with others can be added to clinical decision processes.
Conclusions: In current clinical practice, clinicians, and patients are informed about the patient's risk group only with cancer staging. With the feature set we propose, clinicians and patients can assess the risk group of their patients according to the patient-specific clinical and molecular parameters. Our machine learning model may serve as a practical and reliable prognosis predictive tool for LUAD and LUSC and could provide novel insights into the understanding of the underlying clinical and molecular mechanisms of LUAD and LUSC.
Keywords: Machine Learning, Lung Adenocarcinoma, Lung Squamous Cell Carcinoma, Prognosis Prediction Model, TCGA, Multi-omics, Data Integration
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.