Predicting the difference in thermodynamic stability between protein variants is crucial for protein design and understanding the genotype-phenotype relationships. So far, several computational tools have been created to address this task. Nevertheless, most of them have been trained or optimized on the same and ‘all’ available data, making a fair comparison unfeasible. Here, we introduce a novel dataset, collected and manually cleaned from the latest version of the ThermoMutDB database, consisting of 669 variants not included in the most widely used training datasets. The prediction performance and the ability to satisfy the antisymmetry property by considering both direct and reverse variants were evaluated across 21 different tools. The Pearson correlations of the tested tools were in the ranges of 0.21–0.5 and 0–0.45 for the direct and reverse variants, respectively. When both direct and reverse variants are considered, the antisymmetric methods perform better achieving a Pearson correlation in the range of 0.51–0.62. The tested methods seem relatively insensitive to the physiological conditions, performing well also on the variants measured with more extreme pH and temperature values. A common issue with all the tested methods is the compression of the $\Delta \Delta G$ predictions toward zero. Furthermore, the thermodynamic stability of the most significantly stabilizing variants was found to be more challenging to predict. This study is the most extensive comparisons of prediction methods using an entirely novel set of variants never tested before.
The prediction of free energy changes upon protein residue variations is an important application in biophysics and biomedicine. Several methods have been developed to address this problem so far, including physical-based and machine learning models. However, most of the current computational tools, especially data-driven approaches, fail to incorporate the antisymmetric basic thermodynamic principle: a variation from wild-type to a mutated form of the protein structure ( X W → X M ) and its reverse process ( X M → X W ) must have opposite values of the free energy difference: Δ Δ G W M = − Δ Δ G M W . Here, we build a deep neural network system that, by construction, satisfies the antisymmetric properties. We show that the new method (ACDC-NN) achieved comparable or better performance with respect to other state-of-the-art approaches on both direct and reverse variations, making this method suitable for scoring new protein variants preserving the antisymmetry. The code is available at: https://github.com/compbiomed-unito/acdc-nn.
Estimating the functional effect of single amino acid variants in proteins is fundamental for predicting the change in the thermodynamic stability, measured as the difference in the Gibbs free energy of unfolding, between the wild-type and the variant protein (ΔΔG). Here, we present the web-server of the DDGun method, which was previously developed for the ΔΔG prediction upon amino acid variants. DDGun is an untrained method based on basic features derived from evolutionary information. It is antisymmetric, as it predicts opposite ΔΔG values for direct (A → B) and reverse (B → A) single and multiple site variants. DDGun is available in two versions, one based on only sequence information and the other one based on sequence and structure information. Despite being untrained, DDGun reaches prediction performances comparable to those of trained methods. Here we make DDGun available as a web server. For the web server version, we updated the protein sequence database used for the computation of the evolutionary features, and we compiled two new data sets of protein variants to do a blind test of its performances. On these blind data sets of single and multiple site variants, DDGun confirms its prediction performance, reaching an average correlation coefficient between experimental and predicted ΔΔG of 0.45 and 0.49 for the sequence-based and structure-based versions, respectively. Besides being used for the prediction of ΔΔG, we suggest that DDGun should be adopted as a benchmark method to assess the predictive capabilities of newly developed methods. Releasing DDGun as a web-server, stand-alone program and docker image will facilitate the necessary process of method comparison to improve ΔΔG prediction.
Several studies have linked disruptions of protein stability and its normal functions to disease. Therefore, during the last few decades, many tools have been developed to predict the free energy changes upon protein residue variations. Most of these methods require both sequence and structure information to obtain reliable predictions. However, the lower number of protein structures available with respect to their sequences, due to experimental issues, drastically limits the application of these tools. In addition, current methodologies ignore the antisymmetric property characterizing the thermodynamics of the protein stability: a variation from wild-type to a mutated form of the protein structure (XW → XM) and its reverse process (XM → XW) must have opposite values of the free energy difference (ΔΔGWM = – ΔΔGMW). Here we propose ACDC-NN-Seq, a deep neural network system that exploits the sequence information and is able to incorporate into its architecture the antisymmetry property. To our knowledge, this is the first convolutional neural network} to predict protein stability changes relying solely on the protein sequence. We show that ACDC-NN-Seq compares favorably with the existing sequence-based methods.
Amyotrophic lateral sclerosis (ALS) is a highly complex and heterogeneous neurodegenerative disease that affects motor neurons. Since life expectancy is relatively low, it is essential to promptly understand the course of the disease to better target the patient’s treatment. Predictive models for disease progression are thus of great interest. One of the most extensive and well-studied open-access data resources for ALS is the Pooled Resource Open-Access ALS Clinical Trials (PRO-ACT) repository. In 2015, the DREAM-Phil Bowen ALS Prediction Prize4Life Challenge was held on PRO-ACT data, where competitors were asked to develop machine learning algorithms to predict disease progression measured through the slope of the ALSFRS score between 3 and 12 months. However, although it has already been successfully applied in several studies on ALS patients, to the best of our knowledge deep learning approaches still remain unexplored on the ALSFRS slope prediction in PRO-ACT cohort. Here, we investigate how deep learning models perform in predicting ALS progression using the PRO-ACT data. We developed three models based on different architectures that showed comparable or better performance with respect to the state-of-the-art models, thus representing a valid alternative to predict ALS disease progression.
AimsTakotsubo syndrome (TTS) is associated with a substantial rate of adverse events. We sought to design a machine‐learning (ML) based model to predict the risk of in‐hospital death and to perform a clustering of TTS patients to identify different risk profiles.Methods and resultsA Ridge Logistic Regression‐based ML model for predicting in‐hospital death was developed on 3482 TTS patients from the International Takotsubo Registry, randomly split in a train and an internal validation cohort (75% and 25% of the sample size, respectively) and evaluated in an external validation cohort (1037 patients). 31 clinically relevant variables were included in the prediction model. Model performance represented the primary endpoint and was assessed according to area under the receiver‐operating characteristic curve (AUC), Sensitivity and Specificity. As secondary endpoint, a K‐Medoids clustering algorithm was designed to stratify patients into phenotypic groups based on the ten most relevant features emerging from the main model. The overall incidence of in‐hospital death was 5.2%. The InterTAK‐ML model showed an AUC of 0.89 (0.85‐0.92), Sensitivity 0.85 (0.78‐0.95) and Specificity 0.76 (0.74‐0.79) in the internal validation cohort and an AUC of 0.82 (0.73‐0.91), a sensitivity of 0.74 (0.61‐0.87) and a specificity of 0.79 (0.77‐0.81) in the external cohort for in‐hospital death prediction. By exploiting the 10 variables showing the highest feature importance, TTS patients were clustered into six groups associated with different risks of in‐hospital death (28.8% vs 15.5% vs 5.4% vs 0.8% vs 0.5%) which were consistent also in the external cohort.ConclusionA ML‐based approach for the identification of TTS patients at risk of adverse short‐term prognosis is feasible and effective. The InterTAK‐ML model showed unprecedented discriminative capability for the prediction of in‐hospital death.This article is protected by copyright. All rights reserved.
The high cosine similarity between some single-base substitution mutational signatures and their characteristic flat profiles could suggest the presence of overfitting and mathematical artefacts. The newest version (v3.3) of the signature database available in the Catalogue Of Somatic Mutations In Cancer (COSMIC) provides a collection of 79 mutational signatures, which has more than doubled with respect to previous version (30 profiles available in COSMIC signatures v2), making more critical the associations between signatures and specific mutagenic processes. This study both provides a systematic assessment of the de novo extraction task through simulation scenarios based on the latest version of the COSMIC signatures and highlights, through a novel approach using archetypal analysis, which COSMIC signatures are redundant and more likely to be considered as mathematical artefacts. 29 archetypes were able to reconstruct the profile of all the COSMIC signatures with cosine similarity >0.8. Interestingly, these archetypes tend to group similar original signatures sharing either the same aetiology or similar biological processes. We believe that these findings will be useful to encourage the development of new de novo extraction methods avoiding the redundancy of information among the signatures while preserving the biological interpretation.
Background Takotsubo syndrome (TTS) is burdened by a not negligible rate of an impaired short-term prognosis. Current existing models, based on classical statistical methods, showed only moderate accuracy to predict the risk of in-hospital adverse events following admission for TTS. We sought to design a machine-learning (ML) based model to predict the risk of in-hospital death among patients admitted for TTS, and to provide clusters of TTS patients associated with different risks of adverse short-term prognosis. Methods A Penalized Logistic Regression-based ML model for predicting in-hospital death was trained and tested on a cohort of 3482 patients with TTS from the international, multicenter, InterTAK Registry. 33 clinically relevant variables were selected to be included in the prediction model. Model performance was assessed according to area under the receiver operating characteristic curve (AUC). A K-Means clustering algorithm was designed to stratify patients into phenotypic groups based on the most relevant features emerging from the main model. Results The overall incidence of in-hospital death was 5.2%. The InterTAK-ML model showed an AUC of 0.88 (95%CI 0.87-0.90) and 0.87 (95%CI 0.83-0.91) with respect to in-hospital death prediction in the train and test cohorts, respectively. By exploiting the 5 variables showing the highest feature importance (use of catecholamines, type of triggering factor, left ventricular ejection fraction, white blood cell count, heart rate), TTS patients were clustered into five groups associated with different risks of in-hospital death (29.4% vs 3.9% vs 1.6% vs 1.3% vs 0.7%). Conclusion A ML-based approach for the identification of TTS patients at risk of adverse short-term prognosis is feasible and effective. The InterTAK-ML model showed accurate discriminative capability for the prediction of in-hospital death. To support clinical decision-making, TTS patients can be clustered into groups entailing different risks of death based on routinely collected variables.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.