Protein sequences contain rich information about protein evolution, fitness landscapes, and stability. Here we investigate how latent space models trained using variational auto-encoders can infer these properties from sequences. Using both simulated and real sequences, we show that the low dimensional latent space representation of sequences, calculated using the encoder model, captures both evolutionary and ancestral relationships between sequences. Together with experimental fitness data and Gaussian process regression, the latent space representation also enables learning the protein fitness landscape in a continuous low dimensional space. Moreover, the model is also useful in predicting protein mutational stability landscapes and quantifying the importance of stability in shaping protein evolution. Overall, we illustrate that the latent space models learned using variational auto-encoders provide a mechanism for exploration of the rich data contained in protein sequences regarding evolution, fitness and stability and hence are well-suited to help guide protein engineering efforts.
Recently, several experimental techniques have emerged for probing RNA structures based on high-throughput sequencing. However, most secondary structure prediction tools that incorporate probing data are designed and optimized for particular types of experiments. For example, RNAstructure-Fold is optimized for SHAPE data, while SeqFold is optimized for PARS data. Here, we report a new RNA secondary structure prediction method, restrained MaxExpect (RME), which can incorporate multiple types of experimental probing data and is based on a free energy model and an MEA (maximizing expected accuracy) algorithm. We first demonstrated that RME substantially improved secondary structure prediction with perfect restraints (base pair information of known structures). Next, we collected structure-probing data from diverse experiments (e.g. SHAPE, PARS and DMS-seq) and transformed them into a unified set of pairing probabilities with a posterior probabilistic model. By using the probability scores as restraints in RME, we compared its secondary structure prediction performance with two other well-known tools, RNAstructure-Fold (based on a free energy minimization algorithm) and SeqFold (based on a sampling algorithm). For SHAPE data, RME and RNAstructure-Fold performed better than SeqFold, because they markedly altered the energy model with the experimental restraints. For high-throughput data (e.g. PARS and DMS-seq) with lower probing efficiency, the secondary structure prediction performances of the tested tools were comparable, with performance improvements for only a portion of the tested RNAs. However, when the effects of tertiary structure and protein interactions were removed, RME showed the highest prediction accuracy in the DMS-accessible regions by incorporating in vivo DMS-seq data.
The multistate Bennett acceptance ratio method (MBAR) and unbinned weighted histogram analysis method (UWHAM) are widely employed approaches to calculate relative free energies of multiple thermodynamic states that gain statistical precision by employing free energy contributions from configurations sampled at each of the simulated λ states. With the increasing availability of high throughput computing resources, a large number of configurations can be sampled from hundreds or even thousands of states. Combining sampled configurations from all states to calculate relative free energies requires the iterative solution of large scale MBAR/ UWHAM equations. In the current work, we describe the development of a fast solver to iteratively solve these large scale MBAR/UWHAM equations utilizing our previous findings that the MBAR/UWHAM equations can be derived as a Rao-Blackwell estimator. The solver is implemented and distributed as a Python module called FastMBAR. Our benchmark results show that FastMBAR is more than 2 times faster than the currently, and widely used solver, pymbar, when it runs on a central processing unit (CPU) and more than 100 times faster than pymbar when it runs on a graphical processing unit (GPU). The significant speedup achieved by FastMBAR running on a GPU is useful not only for solving large scale MBAR/UWHAM equations but also for estimating uncertainty of calculated free energies using bootstrapping where the MBAR/ UWHAM equations need to be solved multiple times.
Fast Fourier transform (FFT)-based protein ligand docking together with parallel simulated annealing for both rigid and flexible receptor docking are implemented on graphical processing unit (GPU) accelerated platforms to significantly enhance the throughput of the CDOCKER and flexible CDOCKER the docking algorithms in the CHARMM program for biomolecule modeling. The FFT-based approach for docking, first applied in protein–protein docking to efficiently search for the binding position and orientation of proteins, is adapted here to search ligand translational and rotational spaces given a ligand conformation in protein–ligand docking. Running on GPUs, our implementation of FFT docking in CDOCKER achieves a 15 000 fold speedup in the ligand translational and rotational space search in protein–ligand docking problems. With this significant speedup it becomes practical to exhaustively search ligand translational and rotational space when docking a rigid ligand into a protein receptor. We demonstrate in this paper that this provides an efficient way to calculate an upper bound for docking accuracy in the assessment of scoring functions for protein–ligand docking, which can be useful for improving scoring functions. The parallel molecular dynamics (MD) simulated annealing, also running on GPUs, aims to accelerate the search algorithm in CDOCKER by running MD simulated annealing in parallel on GPUs. When utilized as part of the general CDOCKER docking protocol, acceleration in excess of 20 times is achieved. With this acceleration, we demonstrate that the performance of CDOCKER for redocking is significantly improved compared with three other popular protein–ligand docking programs on two widely used protein ligand complex data sets: the Astex diverse set and the SB2012 test set. The flexible CDOCKER is similarly improved by the parallel MD simulated annealing on GPUs. Based on the results presented here, we suggest that the accelerated CDOCKER platform provides a highly competitive docking engine for both rigid-receptor and flexible-receptor docking studies and will further facilitate continued improvement in the physics-based scoring function employed in CDOCKER docking studies.
The three-dimensional organization of chromatin is expected to play critical roles in regulating genome functions. High-resolution characterization of its structure and dynamics could improve our understanding of gene regulation mechanisms but has remained challenging. Using a near-atomistic model that preserves the chemical specificity of protein-DNA interactions at residue and base-pair resolution, we studied the stability and folding pathways of a tetra-nucleosome. Dynamical simulations performed with an advanced sampling technique uncovered multiple pathways that connect open chromatin configurations with the zigzag crystal structure. Intermediate states along the simulated folding pathways resemble chromatin configurations reported from in situ experiments. We further determined a six-dimensional free energy surface as a function of the inter-nucleosome distances via a deep learning approach. The zigzag structure can indeed be seen as the global minimum of the surface. However, it is not favored by a significant amount relative to the partially unfolded, in situ configurations. Chemical perturbations such as histone H4 tail acetylation and thermal fluctuations can further tilt the energetic balance to stabilize intermediate states. Our study provides insight into the connection between various reported chromatin configurations and has implications on the in situ relevance of the 30 nm fiber.
λ-dynamics is a generalized ensemble method for alchemical free energy calculations. In traditional λ-dynamics, the alchemical switch variable λ is treated as a continuous variable ranging from 0 to 1 and an empirical estimator is utilized to approximate the free energy. In the present paper, we describe an alternative formulation of λ-dynamics that utilizes the Gibbs sampler framework which we call Gibbs Sampler λ-dynamics (GSLD). GSLD, like traditional λ-dynamics, can be readily extended to calculate free energy differences between multiple ligands in one simulation. We also introduce a new free energy estimator, the Rao-Blackwell estimator (RBE) for use in conjunction with GSLD. Compared with the current empirical estimator, the advantage of RBE is that RBE is an unbiased estimator and its variance is usually smaller than the current empirical estimator. We also show that the multistate Bennett acceptance ratio (MBAR) equation or the unbinned weighted histogram analysis method (UWHAM) equation can be derived using the RBE. We illustrate the use and performance of this new free energy computational framework by application to a simple harmonic system as well as relevant calculations of small molecule relative free energies of solvation and binding to a protein receptor. Our findings demonstrate consistent and improved performance compared with conventional alchemical free energy methods.
Cryo-EM structures illustrate a novel mechanism for Vps4-mediated disassembly of ESCRT-III filaments.
Coarse-grained models have proven helpful for simulating complex systems over long time scales to provide molecular insights into various processes. Methodologies for systematic parametrization of the underlying energy function or force field that describes the interactions among different components of the system are of great interest for ensuring simulation accuracy. We present a new method, potential contrasting, to enable efficient learning of force fields that can accurately reproduce the conformational distribution produced with all-atom simulations. Potential contrasting generalizes the noise contrastive estimation method with umbrella sampling to better learn the complex energy landscape of molecular systems. When applied to the Trp-cage protein, we found that the technique produces force fields that thoroughly capture the thermodynamics of the folding process despite the use of only α-carbons in the coarse-grained model. We further showed that potential contrasting could be applied over large data sets that combine the conformational ensembles of many proteins to improve force field transferability. We anticipate potential contrasting as a powerful tool for building general-purpose coarse-grained force fields.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.