Low-N protein engineering with data-efficient deep learning

Biswas, Sandhyarani; Khimulya, Grigory; Alley, Ethan C.; Esvelt, Kevin M.; Church, George M.

doi:10.1101/2020.01.23.917682

Cited by 97 publications

(194 citation statements)

References 79 publications

Supporting

Mentioning

190

Contrasting

Order By: Relevance

“…(2-3), maximizing the joint probability of sequence and structure in Eq. (1) is equivalent to maximizing the following objective: (4) P(structure) is a fixed distribution which depends only on the protein length and is generated only once at the beginning of simulations; f a PDB is fixed too; hence the optimization focuses on maximizing D KL . The design procedure starts off with picking a random amino acid sequence of a given length L ( L = 100 throughout the study), passing it through trRosetta and background networks and calculating the objective F according to Eq.(4).…”

Section: Methodsmentioning

confidence: 99%

“…Deep learning methods have shown considerable promise in protein engineering. Networks with architectures borrowed from language models have been trained on amino acid sequences, and been used to generate new sequences without considering protein structure explicitly 4,5 . Other methods have been developed to generate protein backbones without consideration of sequence 6 , and to identify amino acid sequences which either fit well onto specified backbone structures [7][8][9] or are conditioned on low-dimensional fold representation 10 ; models tailored to generate sequences and/or structures for specific protein families have also been developed [11][12][13][14] .…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

De novo protein design by deep network hallucination

Anishchenko

Chidyausiku

Овчинников

et al. 2020

Preprint

View full text Add to dashboard Cite

There has been considerable recent progress in protein structure prediction using deep neural networks to infer distance constraints from amino acid residue co-evolution1–3. We investigated whether the information captured by such networks is sufficiently rich to generate new folded proteins with sequences unrelated to those of the naturally occuring proteins used in training the models. We generated random amino acid sequences, and input them into the trRosetta structure prediction network to predict starting distance maps, which as expected are quite featureless. We then carried out Monte Carlo sampling in amino acid sequence space, optimizing the contrast (KL-divergence) between the distance distributions predicted by the network and the background distribution. Optimization from different random starting points resulted in a wide range of proteins with diverse sequences and all alpha, all beta sheet, and mixed alpha-beta structures. We obtained synthetic genes encoding 129 of these network hallucinated sequences, expressed and purified the proteins in E coli, and found that 27 folded to monomeric stable structures with circular dichroism spectra consistent with the hallucinated structures. Thus deep networks trained to predict native protein structures from their sequences can be inverted to design new proteins, and such networks and methods should contribute, alongside traditional physically based models, to the de novo design of proteins with new functions.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

De novo protein design by deep network hallucination

Anishchenko

Chidyausiku

Овчинников

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…In addition to discovering highly functional variants, another benefit of this approach is the opportunity to learn from the numerous suboptimal variants. Machine learning algorithms trained to predict functional activity from protein sequence can assist in elucidating the biochemical determinants of function and predict additional sequences to test (Alley et al, 2019;Bedbrook et al, 2019;Biswas et al, 2020;Wu et al, 2019;Xu et al, 2020;Yang et al, 2018). To this end, the PyronicSF linker sequences were encoded as numerical vectors using the VHSE amino acid descriptor (8 principal components score v ectors derived from h ydrophobic, s teric, and e lectronic properties) (Mei et al, 2005).…”

Section: Sort-seq Assay Of a Pyruvate Biosensor Linker Librarymentioning

confidence: 99%

A sort-seq approach to the development of single fluorescent protein biosensors

Koberstein

Stewart

Mighell

et al. 2020

Preprint

View full text Add to dashboard Cite

The utility of single fluorescent protein biosensors (SFPBs) in biological research is offset by the difficulty in engineering these tools. SFPBs generally consist of three basic components: a circularly permuted fluorescent protein, a ligand-binding domain, and a pair of linkers connecting the two domains. In the absence of predictive methods for biosensor engineering, most designs combining these three components will fail to produce allosteric coupling between ligand binding and fluorescence emission. Methods to construct libraries of biosensor designs with variations in the site of GFP insertion and linker sequences have been developed, however, our ability to construct new variants has exceeded our ability to test them for function. Here, we address this challenge by applying a massively parallel assay termed sort-seq to the characterization of biosensor libraries. Sort-seq combines binned fluorescence-activated cell sorting, next-generation sequencing, and maximum likelihood estimation to quantify the dynamic range of many biosensor variants in parallel. We applied this method to two common biosensor optimization tasks: choice of insertion site and optimization of linker sequences. The sort-seq assay applied to a maltose-binding protein domain-insertion library not only identified previously described high-dynamic-range variants but also discovered new functional insertion-sites with diverse properties. A sort-seq assay performed on a pyruvate biosensor linker library expressed in mammalian cell culture identified linker variants with substantially improved dynamic range. Machine learning models trained on the resulting data can predict dynamic range from linker sequence. This high-throughput approach will accelerate the design and optimization of SFPBs, expanding the biosensor toolbox.

show abstract

“…Successfully addressing this core problem promises to transform the field, leading to better proteins for industry and medicine at a fraction of the cost. A number of ML methods have been implemented to address this 1 , including Gaussian process regression [2][3][4][5] , unsupervised statistical analyses 6 , deep neural networks and sequence models [7][8][9][10][11] . However, a uniformly used set of objectives and benchmarks against which each architecture can be evaluated is currently unavailable.…”

Section: Introductionmentioning

confidence: 99%

The NK Landscape as a Versatile Benchmark for Machine Learning Driven Protein Engineering

Mater

Sandhu

Jackson

2020

Preprint

View full text Add to dashboard Cite

Machine learning (ML) has the potential to revolutionize protein engineering. However, the field currently lacks standardized and rigorous evaluation benchmarks for sequence-fitness prediction, which makes accurate evaluation of the performance of different architectures difficult. Here we propose a unifying framework for ML-driven sequence-fitness prediction. Using simulated (the NK model) and empirical sequence landscapes, we define four key performance metrics: interpolation within the training domain, extrapolation outside the training domain, robustness to sparse training data, and ability to cope with epistasis/ruggedness. We show that architectural differences between algorithms consistently affect performance against these metrics across both experimental and theoretical landscapes. Moreover, landscape ruggedness is revealed to be the greatest determinant of the accuracy of sequence-fitness prediction. We hope that this benchmarking method and the code that accompanies it will enable robust evaluation and comparison of novel architectures in this emerging field and assist in the adoption of ML for protein engineering.

show abstract

Low-N protein engineering with data-efficient deep learning

Cited by 97 publications

References 79 publications

De novo protein design by deep network hallucination

De novo protein design by deep network hallucination

A sort-seq approach to the development of single fluorescent protein biosensors

The NK Landscape as a Versatile Benchmark for Machine Learning Driven Protein Engineering

Contact Info

Product

Resources

About