Motivation The interactions between proteins and other molecules are essential to many biological and cellular processes. Experimental identification of interface residues is a time-consuming, costly, and challenging task, while protein sequence data is ubiquitous. Consequently, many computational and machine learning approaches have been developed over the years to predict such interface residues from sequence. However, the effectiveness of different deep learning architectures and learning strategies for protein–protein, protein–nucleotide, and protein–small molecule interface prediction, has not yet been investigated in great detail. Therefore, we here explore the prediction of protein interface residues using six deep learning architectures and various learning strategies with sequence-derived input features. Results We constructed a large data set dubbed BioDL, comprising protein–protein interactions from the PDB, and DNA/RNA and small molecule interactions from the BioLip database. We also constructed six DL architectures, and evaluated them on the BioDL benchmarks. This shows that no single architecture performs best on all instances. An ensemble architecture, which combines all six architectures, does consistently achieve peak prediction accuracy. We confirmed these results on the published benchmark set by Zhang & Kurgan (ZK448), and on our own existing curated homo- and heteromeric protein interaction data set. Our PIPENN sequence-based ensemble predictor outperforms current state-of-the-art sequence-based protein interface predictors on ZK448 on all interaction types, achieving an AUC-ROC of 0.718 for protein–protein, 0.823 for protein–nucleotide and 0.842 for protein–small molecule. Availability Source code and data sets at https://github.com/ibivu/pipenn/
Motivation Antibodies play an important role in clinical research and biotechnology, with their specificity determined by the interaction with the antigen’s epitope region, as a special type of protein-protein interaction (PPI) interface. The ubiquitous availability of sequence data, allows us to predict epitopes from sequence in order to focus time-consuming wet-lab experiments towards the most promising epitope regions. Here, we extend our previously developed sequence-based predictors for homodimer and heterodimer PPI interfaces to predict epitope residues that have the potential to bind an antibody. Results We collected and curated a high quality epitope dataset from the SAbDab database. Our generic PPI heterodimer predictor obtained an AUC-ROC of 0.666 when evaluated on the epitope test set. We then trained a random forest model specifically on the epitope dataset, reaching AUC 0.694. Further training on the combined heterodimer and epitope datasets, improves our final predictor to AUC 0.703 on the epitope test set. This is better than the best state-of-the-art sequence-based epitope predictor BepiPred-2.0. On one solved antibody-antigen structure of the COVID19 virus spike RNA binding domain, our predictor reaches AUC 0.778. We added the SeRenDIP-CE Conformational Epitope predictors to our webserver, which is simple to use and only requires a single antigen sequence as input, which will help make the method immediately applicable in a wide range of biomedical and biomolecular research. Availability Webserver, source code and datasets at www.ibi.vu.nl/programs/serendipwww/
General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal Take down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.
Motivation: Protein interactions play an essential role in many biological and cellular processes, such as protein—protein interaction (PPI) in signaling pathways, binding to DNA in transcription, and binding to small molecules in receptor activation or enzymatic activity. Experimental identification of protein binding interface residues is a time-consuming, costly, and challenging task. Several machine learning and other computational approaches exist which predict such interface residues. Here we explore if Deep Learning (DL) can be used effectively for this prediction task, and which learning strategies and architectures may be most efficient. We introduce seven DL architectures that are applied to eleven independent test sets, focused on the residues involved in PPI interfaces and in binding RNA/DNA and small molecule ligands. Results: We constructed a large data set dubbed BioDL, comprising protein-protein interaction data from the PDB and protein-ligand interactions (DNA, RNA and small molecules) from the BioLip database. Additionally, we reused our existing curated homo- and heteromeric PPI data sets. We performed several experiments to assess the impact of different data features, spatial forms, encoding schemes, network initializations, loss functions, regularization mechanisms, and activation functions on the performance of the predictors. Benchmarking the resulting DL models with an independent test set (ZK448) shows no single DL architecture performs best on all instances, but that an ensemble of DL architectures consistently achieves peak prediction performance. Our PIPENN's ensemble predictor outperforms current state-of-the-art sequence-based protein interface predictors on all interaction types, achieving AUCs of 0.718 (protein—protein), 0.823 (protein—nucleotide) and 0.842 (protein—small molecule) respectively. Availability: Source code and data sets at https://github.com/ibivu/
The availability of high-throughput molecular profiling techniques has provided more accurate and informative data for regular clinical studies. Nevertheless, complex computational workflows are required to interpret these data. Over the past years, the data volume has been growing explosively, requiring robust human data management to organise and integrate the data efficiently. For this reason, we set up an ELIXIR implementation study, together with the Translational research IT (TraIT) programme, to design a data ecosystem that is able to link raw and interpreted data. In this project, the data from the TraIT Cell Line Use Case (TraIT-CLUC) are used as a test case for this system. Within this ecosystem, we use the European Genome-phenome Archive (EGA) to store raw molecular profiling data; tranSMART to collect interpreted molecular profiling data and clinical data for corresponding samples; and Galaxy to store, run and manage the computational workflows. We can integrate these data by linking their repositories systematically. To showcase our design, we have structured the TraIT-CLUC data, which contain a variety of molecular profiling data types, for storage in both tranSMART and EGA. The metadata provided allows referencing between tranSMART and EGA, fulfilling the cycle of data submission and discovery; we have also designed a data flow from EGA to Galaxy, enabling reanalysis of the raw data in Galaxy. In this way, users can select patient cohorts in tranSMART, trace them back to the raw data and perform (re)analysis in Galaxy. Our conclusion is that the majority of metadata does not necessarily need to be stored (redundantly) in both databases, but that instead FAIR persistent identifiers should be available for well-defined data ontology levels: study, data access committee, physical sample, data sample and raw data file. This approach will pave the way for the stable linkage and reuse of data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.