2020
DOI: 10.1021/acssynbio.0c00219
|View full text |Cite
|
Sign up to set email alerts
|

Signal Peptides Generated by Attention-Based Neural Networks

Abstract: Short (15−30 residue) chains of amino acids at the amino termini of expressed proteins known as signal peptides (SPs) specify secretion in living cells. We trained an attentionbased neural network, the Transformer model, on data from all available organisms in Swiss-Prot to generate SP sequences. Experimental testing demonstrates that the model-generated SPs are functional: when appended to enzymes expressed in an industrial Bacillus subtilis strain, the SPs lead to secreted activity that is competitive with i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
89
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 70 publications
(91 citation statements)
references
References 39 publications
(83 reference statements)
0
89
0
Order By: Relevance
“…These improvements seem to be highly depended on the attached protein and are therefore not per se applicable to other non-related proteins. Novel approaches such as machine learning-based design of signal peptides might help to rationalise the use of SPs, but still need to be transferred to eukaryotic systems 61 . We therefore decided to build a rather diverse signal peptide panel, which can be rapidly assembled using the modular Golden Gate system and tested in a high-throughput manner in a 96-well plate setup.…”
Section: Discussionmentioning
confidence: 99%
“…These improvements seem to be highly depended on the attached protein and are therefore not per se applicable to other non-related proteins. Novel approaches such as machine learning-based design of signal peptides might help to rationalise the use of SPs, but still need to be transferred to eukaryotic systems 61 . We therefore decided to build a rather diverse signal peptide panel, which can be rapidly assembled using the modular Golden Gate system and tested in a high-throughput manner in a 96-well plate setup.…”
Section: Discussionmentioning
confidence: 99%
“…To sum up the basic characteristic of the procedure: Only an initial dataset containing the primary sequences of enzyme variants and the respective biological properties is required. It is different from other ML approaches due to the following characteristics: i) thanks to the Fourier transform, the nonlinear aspects inside the protein sequence are captured; ii) FFT allows new mutations to be introduced at positions not previously explored or new positions of mutations; [15] iii) a single round, as in this case, allows the identification of high performing mutants, while avoiding iv) the need for excessively large datasets customary in other ML [6f] or deep learning approaches; [27a,b] v) no need for alignment‐based amino acid descriptors, [27c] no need for protein sequences of equal length, as well as, vi) large computational resources and/or long computational times are not required [27b,c] . In these two examples cited as references, a graphics processing unit (GPU) is needed for reasonable training time.…”
Section: Resultsmentioning
confidence: 99%
“…Church's team defined and proposed a hit rate that makes it possible to compare machine learning, including deep learning, approaches on a fairly objective basis: [30a] We will use this hit rate. His team has compared 12 methods, to which we have added two recently published methods, that of Wu et al [27b] . and that of Xu et al [31] .…”
Section: Resultsmentioning
confidence: 99%
“…Although secretory signals of 15-30 amino acids are more or less well known for systems such as sec [192] and tat [193] in E. coli, natural evolution seems to have performed only a rather limited and stochastic search of these quite large sequence spaces. Thus, Arnold and colleagues [194] used deep learning methods to model known sequences, and could predict novel ones that were 'highly diverse in sequence, sharing as little as 58% sequence identity with the closest known native signal peptide and 73% ± 9% on average' [194]. These kinds of findings imply strongly that because Nature tends to use weak mutation and strong selection [8], necessarily becoming trapped in local optima, much is to be gained by a deeper exploration of novel sequence spaces.…”
Section: Optimisationmentioning
confidence: 98%