Data-driven strategies are gaining increased attention in protein engineering due to recent advances in access to large experimental databanks of proteins, next-generation sequencing (NGS), high-throughput screening (HTS) methods, and the development of artificial intelligence algorithms. However, the reliable prediction of beneficial amino acid substitutions, their combination, and the effect on functional properties remain the most significant challenges in protein engineering, which is applied to develop proteins and enzymes for biocatalysis, biomedicine, and life sciences. Here, we present a general-purpose framework (PyPEF: pythonic protein engineering framework) for performing data-driven protein engineering using machine learning methods combined with techniques from signal processing and statistical physics. PyPEF guides the identification and selection of beneficial proteins of a defined sequence space by systematically or randomly exploring the fitness of variants and by sampling random evolution pathways. The performance of PyPEF was evaluated concerning its predictive accuracy and throughput on four public protein and enzyme data sets using common regression models. It was proved that the program could efficiently predict the fitness of protein sequences for different target properties (predictive models with coefficient of determination values ranging from 0.58 to 0.92). By combining machine learning and protein evolution, PyPEF enabled the screening of proteins with various functions, reaching a screening capacity of more than 500,000 protein sequence variants in the timeframe of only a few minutes on a personal computer. PyPEF displayed significant accuracies on four public data sets (different proteins and properties) and underlined the potential of integrating data-driven technologies for covering different philosophies by either predicting the fitness of the variants to the highest accuracy accounting for epistatic effects or capturing the general trend of introduced mutations on the fitness in directed protein evolution campaigns. In essence, PyPEF can provide a powerful solution to current sequence exploration and combinatorial problems faced in protein engineering through exhaustive in silico screening of the sequence space.
Protein engineering through directed evolution and (semi)rational approaches has been applied successfully to optimize protein properties for broad applications in molecular biology, biotechnology, and biomedicine. The potential of protein engineering is not yet fully realized due to the limited screening throughput hampering the efficient exploration of the vast protein sequence space. Data-driven strategies have emerged as a powerful tool to leverage protein engineering by providing a model of the sequence-fitness landscape that can exhaustively be explored in silico and capitalize on the high diversity potential offered by nature. However, as both the quality and quantity of the inputted data determine the success of such approaches, the applicability of data-driven strategies is often limited due to sparse data. Here, we present a hybrid model that combines direct coupling analysis and machine learning techniques to enable data-driven protein engineering when only a few labeled sequences are available. Our method achieves high performance in predicting a proteins fitness based on its sequence regardless of the number of sequences-fitness pairs in the training dataset. Besides reducing the computational effort compared to state-of-the-art methods, it outperforms them for sparse data situations, i.e., 50-250 labeled sequences available for training. In essence, the developed method is auspicious for data-driven protein engineering, especially for protein engineers who have only access to a limited amount of data for sequence-fitness landscape modeling.
Protein engineering campaigns are driven by the demand for superior enzyme performance under non-natural process conditions, such as elevated temperature or non-neutral pH, to achieve utmost efficiency and conserve limited resources. Phytases are industrial relevant feed enzymes that contribute to the overall phosphorus (P) management by catalyzing the stepwise phosphate hydrolysis from phytate, which is the main phosphorus storage in plants. Phosphorus is referred to as a critical disappearing nutrient, emphasizing the urgent need to implement strategies for a sustainable circular use and recovery of P from renewable resources. Engineered phytases already contribute today to an efficient phosphorus mobilization in the feeding industry and might pave the way to a circular P-bioeconomy. To date, a bottleneck in its application is the drastically reduced hydrolysis on lower phosphorylated reaction intermediates (lower inositol phosphates, ≤InsP4) and their subsequent accumulation. Here, we report the first KnowVolution campaign of the E. coli phytase toward improved hydrolysis on InsP4 and InsP3. As a prerequisite prior to evolution, a suitable screening setup was established and three isomers Ins(2,4,5)P3, Ins(2,3,4,5)P4 and Ins(1,2,5,6)P4 were generated through enzymatic hydrolysis of InsP6 and subsequent purification by HPLC. Screening of epPCR libraries identified clones with improved hydrolysis on Ins(1,2,5,6)P4 carrying substitutions involved in substrate binding and orientation. Saturation of seven positions and screening of, in total, 10,000 clones generated a dataset of 46 variants on their activity on all three isomers. This dataset was used for training, testing, and inferring models for machine learning guided recombination. The PyPEF method used allowed the prediction of recombinants from the identified substitutions, which were analyzed by reverse engineering to gain molecular understanding. Six variants with improved InsP4 hydrolysis of >2.5 were identified, of which variant T23L/K24S had a 3.7-fold improved relative activity on Ins(2,3,4,5)P4 and concomitantly shows a 2.7-fold improved hydrolysis of Ins(2,4,5)P3. Reported substitutions are the first published Ec phy variants with improved hydrolysis on InsP4 and InsP3.
Recently, the study of chitinases has become an important target of numerous research projects due to their potential for applications, such as biocontrol pest agents. Plant chitinases from carnivorous plants of the genus Drosera are most aggressive against a wide range of phytopathogens. However, low solubility or insolubility of the target protein hampered application of chitinases as biofungicides. To obtain plant chitinase from carnivorous plants of the genus Drosera in soluble form in E.coli expression strains, three different approaches including dialysis, rapid dilution, and refolding on Ni-NTA agarose to renaturation were tested. The developed « Rapid dilution » protocol with renaturation buffer supplemented by 10% glycerol and 2M arginine in combination with the redox pair of reduced/oxidized glutathione, increased the yield of active soluble protein to 9.5 mg per 1 g of wet biomass. A structure-based removal of free cysteines in the core domain based on homology modeling of the structure was carried out in order to improve the soluble of chitinase. One improved chitinase variant (C191A/C231S/C286T) was identified which shows improved expression and solubility in E. coli expression systems compared to wild type. Computational analyzes of the wild-type and the improved variant revealed overall higher fluctuations of the structure while maintaining a global protein stability. It was shown that free cysteines on the surface of the protein globule which are not involved in the formation of inner disulfide bonds contribute to the insolubility of chitinase from Drosera capensis. The functional characteristics showed that chitinase exhibits high activity against colloidal chitin (360 units/g) and high fungicidal properties of recombinant chitinases against Parastagonospora nodorum. Latter highlights the application of chitinase from D. capensis as a promising enzyme for the control of fungal pathogens in agriculture.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.