DeepPrime2Sec: Deep Learning for Protein Secondary Structure Prediction from the Primary Sequences

Asgari, Ehsaneddin; Poerner, Nina; McHardy, Alice C.; Mofrad, Mohammad R. K.

doi:10.1101/705426

Cited by 27 publications

(17 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Like natural language processing scenarios, fine-tuning of language model-baed representations improved the downstream supervised task performance, which is particularly evident for small training sets. The success of automatic representation learning approaches in our experiments motivates exploring of contextualized embedding (transformers (Rao et al, 2019) or ELMo embeddings (Asgari et al, 2019b;Heinzinger et al, 2019)) as future directions.…”

Section: Conclusion and Discussionmentioning

confidence: 89%

WITHDRAWN: ToxVec: Deep Language Model-Based Representation Learning for Venom Peptide Classification

Ahmadi

Jahed‐Motlagh

Asgari

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Venom is a mixture of substances produced by a venomous organism aiming at preying, defending, or intraspecific competing resulting in certain unwanted conditions for the target organism. Venom sequences are a highly divergent class of proteins making their machine learning-based and homology-based identification challenging. Prominent applications in drug discovery and healthcare, while having scarcity of annotations in the protein databases, made automatic identification of venom an important protein informatics task. Most of the existing machine learning approaches rely on engineered features, where the predictive model is trained on top of those manually designed features. Recently, transfer learning and representation learning resulted in significant advancements in many machine learning problem settings by automatically learning the essential features. This paper proposes an approach, called ToxVec, for automatic representation learning of protein sequences for the task of venom identification. We show that pre-trained language model-based representation outperforms the existing approaches in terms of the F1 score of both positive and negative classes achieving a macro-F1 of 0.89. We also show that an ensemble classifier trained over multiple training sets constructed from multiple down-samplings of the negative class instances can substantially improve a macro-F1 score to 0.93, which is 7 percent higher than the state-of-the-art performance.

show abstract

Section: Conclusion and Discussionmentioning

confidence: 89%

WITHDRAWN: ToxVec: Deep Language Model-Based Representation Learning for Venom Peptide Classification

Ahmadi

Jahed‐Motlagh

Asgari

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, in our implementation, we proposed the scale's negative logarithm before normalization, improving its discriminative power. Sequence-based embeddings have been used successfully in protein functional/structural annotations tasks previously such as secondary structure prediction (Li and Yu, 2016;Asgari et al, 2019a), point mutations (Zhou et al, 2020), protein function prediction (Asgari and Mofrad, 2015;Zhou et al, 2019;Bonetta and Valentino, 2020), and predicting structural motifs (Liu et al, 2018). In this paper, we proposed the use of ProtVec embeddings and k-mers for linear BCE prediction improving state-of-theart performance on different datasets.…”

Section: Discussionmentioning

confidence: 99%

EpitopeVec: Linear Epitope Prediction Using Deep Protein Sequence Embeddings

Bahai

Asgari²,

Mofrad

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

MotivationB-cell epitopes (BCEs) play a pivotal role in the development of peptide vaccines, immunodiagnostic reagents, and antibody production, and thus generally in infectious disease prevention and diagnosis. Experimental methods used to determine BCEs are costly and time-consuming. It thus becomes essential to develop computational methods for the rapid identification of BCEs. Though several computational methods have been developed for this task, cross-testing of classifiers trained and tested on different datasets revealed their limitations, with accuracies of 51 to 53%.ResultsWe describe a new method called EpitopeVec, which utilizes residue properties, modified antigenicity scales, and a Protvec representation of peptides for linear BCE prediction with machine learning techniques. Evaluating on several large and small data sets, as well as cross-testing demonstrated an improvement of the state-of-the-art performances in terms of accuracy and AUC. Predictive performance depended on the type of antigen (viral, bacterial, eukaryote, etc.). In view of that, we also trained our method on a large viral dataset to create a linear viral BCE predictor.AvailablityThe software is available at https://github.com/hzi-bifo/epitope-prediction under the GPL3.0 license.Contactalice.mchardy@helmholtz-hzi.deSupplementary informationSupplementary data are available at Bioinformatics online.

show abstract

“…Also, deep language models, such as BERT 91 and ELMO 46 were originally developed for NLP, and later employed for protein representations 23,28 . Furthermore, Convolutional Neural networks (CNNs), having the ability to learn to summarize the data with adaptive filters, have been employed to represent proteins 23,63,86,102,103 . Additionally, architectures that are capable of inferring patterns from sequential data (e.g., protein sequences) using the attention mechanism 23,55 , such as Long Short-Term Memory (LSTM) neural networks 23,28,44,104,105 and transformer based algorithms 106 , are used in representation methods.…”

Section: Different Approaches For Representing Proteinsmentioning

confidence: 99%

Evaluation of Methods for Protein Representation Learning: A Quantitative Analysis

Unsal

Ataş

Albayrak

et al. 2020

Preprint

View full text Add to dashboard Cite

Data-centric approaches have been utilized to develop predictive methods for elucidating uncharacterized aspects of proteins such as their functions, biophysical properties, subcellular locations and interactions. However, studies indicate that the performance of these methods should be further improved to effectively solve complex problems in biomedicine and biotechnology. A data representation method can be defined as an algorithm that calculates numerical feature vectors for samples in a dataset, to be later used in quantitative modelling tasks. Data representation learning methods do this by training and using a model that employs statistical and machine/deep learning algorithms. These novel methods mostly take inspiration from the data-driven language models that have yielded ground-breaking improvements in the field of natural language processing. Lately, these learned data representations have been applied to the field of protein informatics and have displayed highly promising results in terms of extracting complex traits of proteins regarding sequence-structure-function relations. In this study, we conducted a detailed investigation over protein representation learning methods, by first categorizing and explaining each approach, and then conducting benchmark analyses on; (i) inferring semantic similarities between proteins, (ii) predicting ontology-based protein functions, and (iii) classifying drug target protein families. We examine the advantages and disadvantages of each representation approach over the benchmark results. Finally, we discuss current challenges and suggest future directions. We believe the conclusions of this study will help researchers in applying machine/deep learning-based representation techniques on protein data for various types of predictive tasks. Furthermore, we hope it will demonstrate the potential of machine learning-based data representations for protein science and inspire the development of novel methods/tools to be utilized in the fields of biomedicine and biotechnology.

show abstract

DeepPrime2Sec: Deep Learning for Protein Secondary Structure Prediction from the Primary Sequences

Cited by 27 publications

References 27 publications

WITHDRAWN: ToxVec: Deep Language Model-Based Representation Learning for Venom Peptide Classification

WITHDRAWN: ToxVec: Deep Language Model-Based Representation Learning for Venom Peptide Classification

EpitopeVec: Linear Epitope Prediction Using Deep Protein Sequence Embeddings

Evaluation of Methods for Protein Representation Learning: A Quantitative Analysis

Contact Info

Product

Resources

About