N-linked glycosylation is one of the most common protein post-translation modifications (PTMs) in humans where the Asparagine (N) amino acid of the protein is attached to the glycan. It is involved in most biological processes and associated with various human diseases as diabetes, cancer, coronavirus, influenza, and Alzheimer's. Accordingly, identifying N-linked glycosylation sites will be beneficial to understanding the system and mechanism of glycosylation. Due to the experimental challenges of glycosylation site identification, machine learning becomes very important to predict the glycosylation sites. This paper proposes a novel N-linked glycosylation predictor based on bagging positive-unlabeled (PU) learning and stacking ensemble machine learning (PUStackNGly). In the proposed PUStackNGly, comprehensive sequence and structural-based features are extracted using different feature extraction descriptors. Then, ensemble-based feature selection is employed to select the most significant and stable features. The ensemble bagging PU learning selects the reliable negative samples from the unlabeled samples using four supervised learning methods (support vector machines, random forest, logistic regression, and XGBoost). Then, stacking ensemble learning is applied using four base classifiers: logistic regression, artificial neural networks, random forest, and support vector machine. The experiments results show that PUStackNGly has a promising predicting performance compared to supervised learning methods. Furthermore, the proposed PUStackNgly outperforms the existing N-linked glycosylation prediction tools on an independent dataset with 95.11% accuracy, 100% recall 80.7% precision, 89.32% F1 score, 96.93% AUC, and 0.87 MCC.
O-glycosylation is a typical type of protein post-translational modifications (PTMs), which is linked to several diseases and has significant roles in many biological processes. Identification of Oglycosylation sites is important to know the mechanism of the O-glycosylation process. However, the identification of PTM sites by laboratory experimental tools is time and money-consuming. Thus, the utilization of computational and artificial intelligence is becoming essential to predict O-glycosylation sites. In this paper, we proposed a new model to improve O-glycosylation site prediction using a transformer-based protein language model and machine learning. The dataset was collected and prepared from a recent data source called OGP (O-glycoprotein repository). The TAPE (Tasks Assessing Protein Embeddings) protein language model was used to feature extraction from the peptide sequences using the embedding strategy. Then, feature selection was implemented using the linear support vector machine (SVM) to select informative features. The XGBoost ensemble-based machine learning method was utilized for classification and prediction. The proposed model achieved high-performance results with 0.7761 accuracy, 0.7391 sensitivity, 0.8130 specificity, 0.8295 AUC, and 0.5537 MCC when compared with the traditional machine learning methods. On an independent dataset, the proposed method performed better than the latest available methods for predicting O-glycosylation sites.
Post-translational glycosylation and glycation are common types of protein post-translational modifications (PTMs) in which glycan binds to protein enzymatically or nonenzymatically, respectively. They are associated with various diseases such as coronavirus, Alzheimer’s, cancer, and diabetes diseases. Identifying glycosylation and glycation sites is significant to understanding their biological mechanisms. However, utilizing experimental laboratory tools to identify PTM sites is time-consuming and costly. In contrast, computational methods based on machine learning are becoming increasingly essential for PTM site prediction due to their higher performance and lower cost. In recent years, advances in Transformer-based Language Models based on deep learning have been transferred from Natural Language Processing (NLP) into the proteomics field by developing language models for protein sequence representation known as Protein Language Models (PLMs). In this work, we proposed a novel method, PTG-PLM, for improving the performance of PTM glycosylation and glycation site prediction. PTG-PLM is based on convolutional neural networks (CNNs) and embedding extracted from six recent PLMs including ProtBert-BFD, ProtBert, ProtAlbert, ProtXlnet, ESM-1b, and TAPE. The model is trained and evaluated on two public datasets for glycosylation and glycation site prediction. The results show that PTG-PLM based on ESM-1b and ProtBert-BFD has better performance than PTG-PLM based on the other PLMs. Comparison results with the existing tools and representative supervised learning methods show that PTG-PLM surpasses the other models for glycosylation and glycation site prediction. The outstanding performance results of PTG-PLM indicate that it can be used to predict the sites of the other types of PTMs.
Glycans are important biological molecules that can be found on their own or attached to other molecules. They have complex, branching structures that do not follow the linear structure. Glycans are crucial for many biological processes and they are involved in the development of several important diseases. Due to the complexity and the branched structure of glycans, most of the current studies have mainly focused on the other attached molecules instead of glycans themselves. This paper proposes, GNNGLY, a graph neural networks model for glycans classification. Firstly, Glycans are represented as molecular graphs, where atoms are represented as nodes and bonds are represented as edges. Graph convolutional networks (GCNs) are then used to make predictions on eight taxonomic classification levels and for the level of immunogenicity property. The performance results indicate that GNNGLY outperforms traditional machine learning methods and when compared to other existing tools for glycan classification, GNNGLY showed considerable performance results. GNNGLY could have a significant impact on the field of glycoinformatics and related research areas.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.