Motivation The biological effects of human missense variants have been studied experimentally for decades but predicting their effects in clinical molecular diagnostics remains challenging. Available computational tools are usually based on the analysis of sequence conservation and structural properties of the mutant protein. We recently introduced a new machine learning method that demonstrated for the first time the significance of protein dynamics in determining the pathogenicity of missense variants. Results Here, we present a new interface (Rhapsody) that enables fully automated assessment of pathogenicity, incorporating both sequence coevolution data and structure- and dynamics-based features. Benchmarked against a dataset of about 20 000 annotated variants, the methodology is shown to outperform well-established and/or advanced prediction tools. We illustrate the utility of Rhapsody by in silico saturation mutagenesis studies of human H-Ras, phosphatase and tensin homolog and thiopurine S-methyltransferase. Availability and implementation The new tool is available both as an online webserver at http://rhapsody.csb.pitt.edu and as an open-source Python package (GitHub repository: https://github.com/prody/rhapsody; PyPI package installation: pip install prody-rhapsody). Links to additional resources, tutorials and package documentation are provided in the 'Python package' section of the website. Supplementary information Supplementary data are available at Bioinformatics online.
Multiple myeloma is a treatable, but currently incurable, hematological malignancy of plasma cells characterized by diverse and complex tumor genetics for which precision medicine approaches to treatment are lacking. The MMRF CoMMpass study is a longitudinal, observational clinical study of newly diagnosed multiple myeloma patients where tumor samples are characterized using whole genome, exome, and RNA sequencing at diagnosis and progression, and clinical data is collected every three months. Analyses of the baseline cohort identified genes that are the target of recurrent gain- and loss-of-function events. Consensus clustering identified 8 and 12 unique copy number and expression subtypes of myeloma, respectively, identifying high-risk genetic subtypes and elucidating many of the molecular underpinnings of these unique biological groups. Analysis of serial samples showed 25.5% of patients transition to a high-risk expression subtype at progression. We observed robust expression of immunotherapy targets in this subtype, suggesting a potential therapeutic option.
Multiple Myeloma (MM) is a genetically heterogeneous disease of plasma cells that generally exhibits chromosomal abnormalities and distinct gene expression signatures. Previous studies have sought to identify gene expression indices using microarray technology to discern genes associated with survival outcomes to predict whether a newly diagnosed patient has an aggressive form of the disease. One such MM-specific index is the UAMS 70 gene index, which is composed of 51 over- and 19 under-expressed genes. This index was developed using Affymetrix U133Plus2.0 microarray data from 532 MM patients at diagnosis by computing log-rank test statistics on gene expression quartiles. Despite consistently achieving a high performance across a variety of MM datasets, issues arise when applying this index to RNAseq data. Here we address those issues, deriving an independent index based on the RNAseq data from the Multiple Myeloma Research Foundation (MMRF) CoMMpass Study (NCT01454297), and benchmark its performance to an implementation of the UAMS 70 gene index. UAMS index scores are computed by taking the difference between the average log2-scale expression of the 51 over- and 19 under-expressed genes. We applied this calculation to RNAseq data analyzed using Sailfish, Salmon v7.2, and HTseq counts collected from 41 Multiple Myeloma Genomics Initiative samples and compared the results to scores from matching GCRMA, MAS5, RMA, and PLIER16 Affymetrix U133Plus2.0 microarray data. Differences in the distribution of index values across data types led to nonconforming classification of high-risk individuals. Additionally, when applied to RNAseq data, several Affymetrix probesets did not uniquely match to gene annotations from Ensembl-v74. This reduced the number of genes upon which our UAMS score was calculated to 61 genes. Of the original 51 over-expressed probes, only 44 uniquely mapped genes remained after 7 multi-mapped probes are removed and similarly, out of the 19 under-expressed genes only 17 were uniquely mapped. Given the complication of probe-gene mismatch and inconsistencies identifying high-risk individuals when applied to RNAseq data, we developed an independent index using the baseline RNAseq data from the MMRF CoMMpass Study IA13 dataset. From a training set (n=375) of RNAseq data measuring 56430 genes, we performed univariate log-rank tests on expression quartiles associated with disease-related survival while controlling for an FDR of 2.5%, resulting in 23 under- and 332 over-expressed genes. Subsequent multivariate Cox regression analysis and backward stepwise selection culminated in the identification of the CoMMpass RNAseq index, which is based on the ratio of mean expression values of 87 genes (19 under- and 68 over-expressed) predictive of high risk (hazard ratio [HR] = 8.7341, 95% CI = 5.615-13.58, p < 0.001). Validation on the test set (n=251) yielded a HR of 5.612 (95% CI = 3.066-10.27, p < 0.001) as compared to a HR of 4.753 (95% CI = 2.688-8.403, p < 0.001) achieved with the adapted UAMS index. Adjusting for a patient's International Staging System (ISS) stage revises these hazard ratios to 6.236 (95% CI = 3.345-11.627, p < 0.001) and 3.6420 (95% CI = 1.9726-6.724, p < 0.001) for the CoMMpass RNAseq and the adapted UAMS indices, respectively. Furthermore, the distribution of CoMMpass RNAseq index values across the training and test set show no observable bias with respect to three main therapy arms, suggesting it is predictive of high risk independent of treatment. Our newly derived CoMMpass RNAseq index shares one gene in common with the UAMS 61 gene index (CENPW) and recovers two over-expressed genes (FABP5, TAGLN2), which were removed from the UAMS 70 gene index due to probe multimapping. When the recovered genes are added back to the UAMS index, the unadjusted and adjusted hazard ratios measured for the test set are 5.173 (CI = 2.926-9.146, p < 0.001) and 4.022 (CI = 2.1840-7.408, p < 0.001), respectively. Of the original 70 genes in the UAMS index, 21 (30%) map to chromosome 1, which frequently exhibits copy number gains in MM. Only 11 of the 87 (13%) genes in our proposed index map to chr1, which indicates that, given its performance, the newly derived list of genes may represent a more diverse index to predict, and provide novel insights into, high risk MM. Altogether, the CoMMpass RNAseq index identifies a high risk signature in 13% of MM patients and outperforms the UAMS index. Disclosures Lonial: Amgen: Research Funding.
Plasma cell leukemia (PCL) is rare but represents an aggressive, advanced form of multiple myeloma where neoplastic plasma cells (PCs) lose dependence on the bone marrow (BM) and circulate in the peripheral blood (PB). PCL is clinically defined by diagnosis of myeloma with ≥20% circulating plasma cells (CPCs), however, several groups have proposed a ≥5% CPC cutoff. PCL is classified as primary (pPCL) if it presents at myeloma diagnosis or secondary (sPCL) if it arises at a later progression event. These presentations of PCL are clinically distinct, with sPCL patients responding poorly to novel therapies and having inferior outcomes compared to pPCL patients. Despite recent advances in myeloma therapy, PCL prognosis remains poor, and the molecular drivers of PCL remain poorly understood. The MMRF CoMMpass study (NCT01454297) is a longitudinal, observational clinical study of 1171 newly-diagnosed myeloma patients. Tumors were characterized using whole genome (WGS), exome (WES), and RNA (RNAseq) sequencing at diagnosis and each progression event. PCs were isolated from BM-derived tumors and when >5% CPCs were detected, PCs were also isolated from the PB, creating a subcohort of patients with sequencing data from both the BM and PB compartments, with some patients assayed longitudinally. The percent CPCs determined using flow cytometry was reported for 982 patients at myeloma diagnosis and 194 patients at progression. Patients with 5-20% CPCs (median = 19 months) at diagnosis had poor overall survival (OS) outcomes compared to those with less than 5% CPCs (median = 95 months, p<0.001). No outcome difference was observed between patients with 5-20% and >20% CPCs (median = 41 months), confirming the findings of previous independent studies. A ≥5% CPC cutoff identified 947 myeloma, 29 pPCL, and 6 sPCL patients in the CoMMpass cohort. Compared to myeloma, pPCL and sPCL patients had poor OS (p<0.001), and after sPCL detection patients had a median OS of only 53 days (range = 0-169 days). For 10 pPCL patients, the percent CPCs was reported at diagnosis and at least one progression event, and patients with persistent CPCs (n = 5, median = 16 months, p<0.01) had poor OS compared to patients with no detectable CPCs at progression (n = 5, median not met, median follow up = 64 months). This underscores the benefit of early eradication of CPCs and repeated CPC measurements in pPCL. The proliferative (PR) gene expression subtype of myeloma has been previously described and defines a high-risk group of patients with diverse genetic backgrounds and inferior outcomes. For PCL patients, we determined the subtype of all BM and PB tumor samples characterized using RNAseq. There was high subtype concordance between paired BM and PB tumor samples (12/13, 92.3%). Overall, 6/23 (26.1%) pPCL patients were in the PR subtype, and PR pPCL patients had poor OS outcomes (median = 10 months, p<0.001) compared to non-PR pPCL patients (median = 55 months). PR emerged as a robust predictor of risk in pPCL, outperforming other molecular and clinical variables including high BM or PB PCs, plasmacytomas, renal failure, high LDH, high B2M, low platelets, t(11;14), del(1p), amp(1q), del(13q), and del(17p), suggesting that RNA subtyping CPCs may represent a non-invasive tool to predict risk in pPCL. At myeloma diagnosis, all sPCL patients with RNAseq data were classified in non-PR subtypes. However, at sPCL, 5/6 (83.3%) patients were in the PR subtype, indicating that sPCL is associated with transition to PR. Two sPCL patients that transitioned to PR acquired biallelic deletion of CDKN2C, and a third acquired biallelic deletion of RB1. Overall, a subset of pPCL (26.1%) but the majority of sPCL (83.3%) patients were in the PR subtype at PCL diagnosis, providing a molecular basis for the different clinical presentations observed between these two groups, including the highly-aggressive nature of sPCL. In summary, this study supports using a lower percent CPC cutoff to clinically define PCL and highlights the importance of repeated CPC measurements in prognosticating pPCL patients. Further, PR RNA subtype emerged as a predictor of risk in pPCL and, given that the majority of sPCL patients were in the PR subtype, provides a molecular basis for the different clinical features observed between pPCL and sPCL patients. Disclosures Mikhael: Amgen: Consultancy; Takeda: Consultancy; BMS: Consultancy; Janssen: Consultancy; Karyopharm: Consultancy; Sanofi: Consultancy; GSK: Consultancy; Oncopeptides: Consultancy.
The immense size of chemical space, the relative scarcity of high quality data, and the cost of running experiments to accurately measure molecular properties makes active learning (AL) an attractive approach to efficiently explore the space and train high-quality models for molecular property prediction. While AL is traditionally successful at classification, there have been recent advances in using AL for regression tasks. Recently, regressing to a normal inverse gamma distribution has been shown to be effective at predicting molecular properties in the QM9 dataset. However, we present a series of experiments demonstrating that various state of the art AL regression techniques are indistinguishable from random selection for small molecule pKa prediction. Source code for this paper is available at https://github.com/francoep/pKa_activelearning.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.