Directed evolution is an important research activity in synthetic biology and biotechnology. Numerous reports describe the application of tedious mutation/screening cycles for the improvement of proteins. Recently, knowledge-based approaches have facilitated the prediction of protein properties and the identification of improved mutants. However, epistatic phenomena constitute an obstacle which can impair the predictions in protein engineering. We present an innovative sequence-activity relationship (innov’SAR) methodology based on digital signal processing combining wet-lab experimentation and computational protein design. In our machine learning approach, a predictive model is developed to find the resulting property of the protein when the n single point mutations are permuted (2n combinations). The originality of our approach is that only sequence information and the fitness of mutants measured in the wet-lab are needed to build models. We illustrate the application of the approach in the case of improving the enantioselectivity of an epoxide hydrolase from Aspergillus niger. n = 9 single point mutants of the enzyme were experimentally assessed for their enantioselectivity and used as a learning dataset to build a model. Based on combinations of the 9 single point mutations (29), the enantioselectivity of these 512 variants were predicted, and candidates were experimentally checked: better mutants with higher enantioselectivity were indeed found.
Protein structures are classically described in terms of secondary structures. Even if the regular secondary structures have relevant physical meaning, their recognition from atomic coordinates has some important limitations such as uncertainties in the assignment of boundaries of helical and β-strand regions. Further, on an average about 50% of all residues are assigned to an irregular state, i.e., the coil. Thus different research teams have focused on abstracting conformation of protein backbone in the localized short stretches. Using different geometric measures, local stretches in protein structures are clustered in a chosen number of states. A prototype representative of the local structures in each cluster is generally defined. These libraries of local structures prototypes are named as “structural alphabets”. We have developed a structural alphabet, named Protein Blocks, not only to approximate the protein structure, but also to predict them from sequence. Since its development, we and other teams have explored numerous new research fields using this structural alphabet. We review here some of the most interesting applications.
Encoding protein 3D structures into 1D string using short structural prototypes or structural alphabets opens a new front for structure comparison and analysis. Using the well-documented 16 motifs of Protein Blocks (PBs) as structural alphabet, we have developed a methodology to compare protein structures that are encoded as sequences of PBs by aligning them using dynamic programming which uses a substitution matrix for PBs. This methodology is implemented in the applications available in Protein Block Expert (PBE) server. PBE addresses common issues in the field of protein structure analysis such as comparison of proteins structures and identification of protein structures in structural databanks that resemble a given structure. PBE-T provides facility to transform any PDB file into sequences of PBs. PBE-ALIGNc performs comparison of two protein structures based on the alignment of their corresponding PB sequences. PBE-ALIGNm is a facility for mining SCOP database for similar structures based on the alignment of PBs. Besides, PBE provides an interface to a database (PBE-SAdb) of preprocessed PB sequences from SCOP culled at 95% and of all-against-all pairwise PB alignments at family and superfamily levels. PBE server is freely available at .
Background: Disulphide bridges are well known to play key roles in stability, folding and functions of proteins. Introduction or deletion of disulphides by site-directed mutagenesis have produced varying effects on stability and folding depending upon the protein and location of disulphide in the 3-D structure. Given the lack of complete understanding it is worthwhile to learn from an analysis of extent of conservation of disulphides in homologous proteins. We have also addressed the question of what structural interactions replaces a disulphide in a homologue in another homologue.
BackgroundConnecting the dots between the protein sequence and its function is of fundamental interest for protein engineers. In-silico methods are useful in this quest especially when structural information is not available. In this study we propose a mutant library screening tool called iSAR (innovative Sequence Activity Relationship) that relies on the physicochemical properties of the amino acids, digital signal processing and partial least squares regression to uncover these sequence-function correlations.ResultsWe show that the digitalized representation of the protein sequence in the form of a Fourier spectrum can be used as an efficient descriptor to model the sequence-activity relationship of proteins. The iSAR methodology that we have developed identifies high fitness mutants from mutant libraries relying on physicochemical properties of the amino acids, digital signal processing and regression techniques. iSAR correlates variations caused by mutations in spectra with biological activity/fitness. It takes into account the impact of mutations on the whole spectrum and does not focus on local fitness alone. The utility of the method is illustrated on 4 datasets: cytochrome P450 for thermostability, TNF-alpha for binding affinity, GLP-2 for potency and enterotoxins for thermostability. The choice of the datasets has been made such as to illustrate the ability of the method to perform when limited training data is available and also when novel mutations appear in the test set, that have not been featured in the training set.ConclusionThe combination of Fast Fourier Transform and Partial Least Squares regression is efficient in capturing the effects of mutations on the function of the protein. iSAR is a fast algorithm which can be implemented with limited computational resources and can make effective predictions even if the training set is limited in size.Electronic supplementary materialThe online version of this article (10.1186/s12859-018-2407-8) contains supplementary material, which is available to authorized users.
Engineering proteins and enzymes with the desired functionality has broad applications in molecular biology, biotechnology, biomedical sciences, health, and medicine. The vastness of protein sequence space and all the possible proteins it represents can pose a considerable barrier for enzyme engineering campaigns through directed evolution and rational design. The nonlinear effects of coevolution between amino acids in protein sequences complicate this further. Data-driven models increasingly provide scientists with the computational tools to navigate through the largely undiscovered forest of protein variants and catch a glimpse of the rules and effects underlying the topology of sequence space. In this review, we outline a complete theoretical journey through the processes of protein engineering methods such as directed evolution and rational design and reflect on these strategies and data-driven hybrid strategies in the context of sequence space. We discuss crucial phenomena of residue coevolution, such as epistasis, and review the history of models created over the past decade, aiming to infer rules of protein evolution from data and use this knowledge to improve the prediction of the structure− function relationship of proteins. Data-driven models based on deep learning algorithms are among the most promising methods that can account for the nonlinear phenomena of sequence space to some degree. We also critically discuss the available models to predict evolutionary coupling and epistatic effects (classical and deep learning) in terms of their capabilities and limitations. Finally, we present our perspective on possible future directions for developing data-driven approaches and provide key orientation points and necessities for the future of the fast-evolving field of enzyme engineering.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.