Molecular evolution based on mutagenesis is widely used in protein engineering. However, optimal proteins are often difficult to obtain due to a large sequence space. Here, we propose a novel approach that combines molecular evolution with machine learning. In this approach, we conduct two rounds of mutagenesis where an initial library of protein variants is used to train a machine-learning model to guide mutagenesis for the second-round library. This enables us to prepare a small library suited for screening experiments with high enrichment of functional proteins. We demonstrated a proof-of-concept of our approach by altering the reference green fluorescent protein (GFP) so that its fluorescence is changed into yellow. We successfully obtained a number of proteins showing yellow fluorescence, 12 of which had longer wavelengths than the reference yellow fluorescent protein (YFP). These results show the potential of our approach as a powerful method for directed evolution of fluorescent proteins.
To develop a Trichoderma reesei strain appropriate for the saccharification of pretreated cellulosic biomass, a recombinant T. reesei strain, X3AB1, was constructed that expressed an Aspergillus aculeatus β-glucosidase 1 with high specific activity under the control of the xyn3 promoter. The culture supernatant from T. reesei X3AB1 grown on 1% Avicel as a carbon source had 63- and 25-fold higher β-glucosidase activity against cellobiose compared to that of the parent strain PC-3-7 and that of the T. reesei recombinant strain expressing an endogenous β-glucosidase I, respectively. Further, the xylanase activity was 30% lower than that of PC-3-7 due to the absence of xyn3. X3AB1 grown on 1% Avicel-0.5% xylan medium produced 2.3- and 3.3-fold more xylanase and β-xylosidase, respectively, than X3AB1 grown on 1% Avicel. The supernatant from X3AB1 grown on Avicel and xylan saccharified NaOH-pretreated rice straw efficiently at a low enzyme dose, indicating that the strain has good potential for use in cellulosic biomass conversion processes.
The genes encoding the catalytic domains (CD) of the three endoglucanases (EG I; Cel7B, EG II; Cel5A, and EG III; Cel12A) from Trichoderma reesei QM9414 were expressed in Escherichia coli strains Rosetta-gami B (DE3) pLacI or Origami B (DE3) pLacI and were found to produce functional intracellular proteins. Protein production by the three endoglucanase transformants was evaluated as a function of growth temperature. Maximal productivity of EG I-CD at 15 degrees C, EG II-CD at 20 degrees C and EG III at 37 degrees C resulted in yields of 6.9, 72, and 50 mg/l, respectively. The endoglucanases were purified using a simple purification method based on removing E. coli proteins by isoelectric point precipitation. Specific activity toward carboxymethyl cellulose was found to be 65, 49, and 15 U/mg for EG I-CD, EG II-CD, and EG III, respectively. EG II-CD was able to cleave 1,3-1,4-beta-D-glucan and soluble cellulose derivatives. EG III was found to be active against cellulose, 1,3-1,4-beta-D-glucan and xyloglucan, while EG I-CD was active against cellulose, 1,3-1,4-beta-D-glucan, xyloglucan, xylan, and mannan.
The stability and specific activity of endo-beta-1,4-glucanase III from Trichoderma reesei QM9414 was enhanced, and the expression efficiency of its encoding gene, egl3, was optimized by directed evolution using error-prone PCR and activity screening in Escherichia coli RosettaBlue (DE3) pLacI as a host. Relationship between increase in yield of active enzyme in the clones and improvement in its stability was observed among the mutants obtained in the present study. The clone harboring the best mutant 2R4 (G41E/T110P/K173M/Y195F/P201S/N218I) selected in via second-round mutagenesis after optimal recombinating of first-round mutations produced 130-fold higher amount of mutant enzyme than the transformant with wild-type EG III. Mutant 2R4 produced by the clone showed broad pH stability (4.4-8.8) and thermotolerance (entirely active at 55 degrees C for 30 min) compared with those of the wild-type EG III (pH stability, 4.4-5.2; thermostability, inactive at 55 degrees C for 30 min). k (cat) of 2R4 against carboxymethyl-cellulose was about 1.4-fold higher than that of the wild type, though the K (m) became twice of that of the wild type.
The grating substrate covered with a metal layer, a plasmonic chip, and a bispecific antibody can play a key role in the sensitive detection of a marker protein with an immunosensor, because of the provision of an enhanced fluorescence signal and the preparation of a sensor surface densely modified with capture antibody, respectively. In this study, one of the tumor markers, a soluble epidermal growth factor receptor (sEGFR), was selected as the target to be detected. The ZnO- and silver-coated plasmonic chip with precise regularity and the appropriate duty ratio in the periodic structure further enhanced the fluorescence intensity. As for sensor surface modification with capture antibody, a bispecific antibody (anti-sEGFR and anti-ZnO antibody), the concentrated bispecific antibody solution was found to nonlinearly form a surface densely immobilized with antibody, because the binding process of a bispecific antibody to the ZnO surface can be a competitive process with adsorption of phosphate. As a result, the interface on the plasmonic chip provided a 300× enhanced fluorescence signal compared with that on a ZnO-coated glass slide, and therefore sEGFR was found to be quantitatively detected in a wide concentration range from 10 nM to 700 fM on our plasmonic surface.
Machine learning
(ML) is becoming an attractive tool in mutagenesis-based
protein engineering because of its ability to design a variant library
containing proteins with a desired function. However, it remains unclear
how ML guides directed evolution in sequence space depending on the
composition of training data. Here, we present a ML-guided directed
evolution study of an enzyme to investigate the effects of a known
“highly positive” variant (i.e., variant known to have
high enzyme activity) in training data. We performed two separate
series of ML-guided directed evolution of Sortase A with and without
a known highly positive variant called 5M in training data. In each
series, two rounds of ML were conducted: variants predicted by the
initial round were experimentally evaluated and used as additional
training data for the second-round of prediction. The improvements
in enzyme activity were comparable between the two series, both achieving
enzyme activity 2.2–2.5 times higher than 5M. Intriguingly,
the sequences of the improved variants were largely different between
the two series, indicating that ML guided the directed evolution to
the distinct regions of sequence space depending on the presence/absence
of the highly positive variant in the training data. This suggests
that the sequence diversity of improved variants can be expanded not
only by conventional ML using the whole training data but also by
ML using a subset of the training data even when it lacks highly positive
variants. In summary, this study demonstrates the importance of regulating
the composition of training data in ML-guided directed evolution.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.