Developing a biomedical-explainable and validatable text mining pipeline can help in cancer gene panel discovery. We create a pipeline that can contextualize genes by using text-mined co-occurrence features. We apply Biomedical Natural Language Processing (BioNLP) techniques for literature mining in the cancer gene panel. A literature-derived 4,679 × 4,630 gene term-feature matrix was built. The EGFR L858R and T790M, and BRAF V600E genetic variants are important mutation term features in text mining and are frequently mutated in cancer. We validate the cancer gene panel by the mutational landscape of different cancer types. The cosine similarity of gene frequency between text mining and a statistical result from clinical sequencing data is 80.8%. In different machine learning models, the best accuracy for the prediction of two different gene panels, including MSK-IMPACT (Memorial Sloan Kettering-Integrated Mutation Profiling of Actionable Cancer Targets), and Oncomine cancer gene panel, is 0.959, and 0.989, respectively. The receiver operating characteristic (ROC) curve analysis confirmed that the neural net model has a better prediction performance (Area under the ROC curve (AUC) = 0.992). The use of text-mined co-occurrence features can contextualize each gene. We believe the approach is to evaluate several existing gene panels, and show that we can use part of the gene panel set to predict the remaining genes for cancer discovery.
Background Functional disruptions by large germline genomic structural variants in susceptible genes are known risks for cancer. Few studies have used deletion structural variants (DSVs) to predict cancer risk with neural networks or studied the relationship between DSVs and immune gene expression to stratify prognosis.Methods Whole-genome sequencing (WGS) data was analyzed with the blood samples of 192 cancer and 499 noncancer subjects with or without family cancer history (FCH). Ninety-nine colorectal cancer (CRC) patients had immune response gene expression data. To build the cancer risk predictive model and identify DSVs in familial cancer, we used joint calling tools and attention-weighted model. The survival support vector machine (survival-SVM) was used to select prognostic DSVs. Results We identified 671 DSVs that could predict cancer risk. The area under the curve (AUC) of receiver operating characteristic curve (ROC) of attention-weighted model was 0.71. The 3 most frequent DSV genes observed in cancer patients were identified as ADCY9, AURKAPS1, and RAB3GAP2 (p < 0.05). We identified 65 immune-associated DSV markers for assessing cancer prognosis (P < 0.05). The functional protein of MUC4 DSV gene interacted with MAGE1expresssion, according to the STRING database. The causal inference model showed that deleting the CEP72 DSV gene could affect the recurrence-free survival (RFS) of IFIT1 expression. Conclusions We established an explainable attention-weighted model for cancer risk prediction and used the survival-SVM for prognostic stratification by using DSV and immune gene expression datasets. It can provide the genetic landscape of cancer patients and help predict the clinical outcome.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.