Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

Asgari, Ehsaneddin; McHardy, Alice C.; Mofrad, Mohammad R. K.

doi:10.1101/345843

Cited by 10 publications

(10 citation statements)

References 55 publications

(59 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…High-scoring segment pair (HSP) has been used in previous methods for PPI prediction [27]. One-hot vectors [51,52] and amino acid embedding [5,6,19] have also been empirically explored to represent protein sequences.…”

Section: Introductionmentioning

confidence: 99%

DELPHI: accurate deep ensemble model for protein interaction sites prediction

2020

View full text Add to dashboard Cite

Motivation Proteins usually perform their functions by interacting with other proteins, which is why accurately predicting protein-protein interaction (PPI) binding sites is a fundamental problem. Experimental methods are slow and expensive. Therefore, great efforts are being made towards increasing the performance of computational methods. Results We propose DELPHI (DEep Learning Prediction of Highly probable protein Interaction sites), a new sequence-based deep learning suite for PPI binding sites prediction. DELPHI has an ensemble structure which combines a CNN and a RNN component with fine tuning technique. Three novel features, HSP, position information, and ProtVec are used in addition to nine existing ones. We comprehensively compare DELPHI to nine state-of-the-art programs on five datasets, and DELPHI outperforms the competing methods in all metrics even though its training dataset shares the least similarities with the testing datasets. In the most important metrics, AUPRC and MCC, it surpasses the second best programs by as much as 18.5% and 27.7%, resp. We also demonstrated that the improvement is essentially due to using the ensemble model and, especially, the three new features. Using DELPHI it is shown that there is a strong correlation with protein-binding residues (PBRs) and sites with strong evolutionary conservation. In addition DELPHI’s predicted PBR sites closely match known data from Pfam. DELPHI is available as open sourced standalone software and web server. Availability The DELPHI web server can be found at www.csd.uwo.ca/~yli922/index.php, with all datasets and results in this study. The trained models, the DELPHI standalone source code, and the feature computation pipeline are freely available at github.com/lucian-ilie/DELPHI. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Section: Introductionmentioning

confidence: 99%

DELPHI: accurate deep ensemble model for protein interaction sites prediction

2020

View full text Add to dashboard Cite

show abstract

“…ProNA2020 (3) predicts whether or not a protein interacts with other proteins, RNA or DNA, and if the binding residues. Per-protein predictions rely on homology and machine learning models employing profile-kernel SVMs (49) and embeddings from an in-house implementation of ProtVec (50). Per-residue predictions are based on simple neural networks due to the lack of experimental high-resolution annotations (51–53).…”

Section: Methodsmentioning

confidence: 99%

PredictProtein – Predicting Protein Structure and Function for 29 Years

Bernhofer

Dallago

Karl

et al. 2021

Preprint

View full text Add to dashboard Cite

Since 1992 PredictProtein (https://predictprotein.org) is a one-stop online resource for protein sequence analysis with its main site hosted at the Luxembourg Centre for Systems Biomedicine (LCSB) and queried monthly by over 3,000 users in 2020. PredictProtein was the first Internet server for protein predictions. It pioneered combining evolutionary information and machine learning. Given a protein sequence as input, the server outputs multiple sequence alignments, predictions of protein structure in 1D and 2D (secondary structure, solvent accessibility, transmembrane segments, disordered regions, protein flexibility, and disulfide bridges) and predictions of protein function (functional effects of sequence variation or point mutations, Gene Ontology (GO) terms, subcellular localization, and protein-, RNA-, and DNA binding). PredictProtein's infrastructure has moved to the LCSB increasing throughput; the use of MMseqs2 sequence search reduced runtime five-fold; user interface elements improved usability, and new prediction methods were added. PredictProtein recently included predictions from deep learning embeddings (GO and second-ary structure) and a method for the prediction of proteins and residues binding DNA, RNA, or other proteins. Pre-dictProtein.org aspires to provide reliable predictions to computational and experimental biologists alike. All scripts and methods are freely available for offline execution in high-throughput settings.

show abstract

“…Since proteins do not have a well-defined vocabulary of words, word-level tokenization is not a well-defined option in the case of proteins. Subword segmentation, on the other hand, does not require any predefined knowledge of words in the target language, making it a potentially interesting approach for discovering “words'' or motifs in proteins [107] , [8] , [12] , [53] .…”

Section: The Atomic Unit Of Information: Tokenizationmentioning

confidence: 99%

“…In proteins we have only ~20 AAs. While we can embed AAs onto a lower-dimensional space, it is not as clearly beneficial [8] . While dimensionality reduction is of limited use when working on single AAs, it can provide useful compact representations when considering extended AA combinations.…”

Section: Word Embeddingsmentioning

confidence: 99%

The language of proteins: NLP, machine learning & protein sequences

Ofer

Brandes

Linial

2021

Computational and Structural Biotechnology Journal

210

178

View full text Add to dashboard Cite

Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.

show abstract

Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

Cited by 10 publications

References 55 publications

DELPHI: accurate deep ensemble model for protein interaction sites prediction

DELPHI: accurate deep ensemble model for protein interaction sites prediction

PredictProtein – Predicting Protein Structure and Function for 29 Years

The language of proteins: NLP, machine learning & protein sequences

Contact Info

Product

Resources

About