Glycosyltransferases (GTs) are prevalent across the tree of life and regulate nearly all aspects of cellular functions. The evolutionary basis for their complex and diverse modes of catalytic functions remain enigmatic. Here, based on deep mining of over half million GT-A fold sequences, we define a minimal core component shared among functionally diverse enzymes. We find that variations in the common core and emergence of hypervariable loops extending from the core contributed to GT-A diversity. We provide a phylogenetic framework relating diverse GT-A fold families for the first time and show that inverting and retaining mechanisms emerged multiple times independently during evolution. Using evolutionary information encoded in primary sequences, we trained a machine learning classifier to predict donor specificity with nearly 90% accuracy and deployed it for the annotation of understudied GTs. Our studies provide an evolutionary framework for investigating complex relationships connecting GT-A fold sequence, structure, function and regulation.
This article presents a high-sensitivity, quantified, linear, and mediator-free resonator-based microwave biosensor for glucose sensing application. The proposed biosensor comprises an air-bridge-type asymmetrical differential inductor (L) and a center-loaded circular finger-based inter-digital capacitor (C) fabricated on Gallium Arsenide (GaAs) substrate using advanced micro-fabrication technology. The intertwined asymmetrical differential inductor is used to achieve a high inductance value with a suitable Q-factor, and the centralized inter-digital capacitor is introduced to generate an intensified electric field. The designed microwave sensor is optimized to operate at a low resonating frequency that increases the electric field penetration depth and interaction area in the glucose sample. The microwave biosensor is tested with different glucose concentrations (0.3–5 mg/ml), under different ambient temperatures (10–50 °C). The involvement of advanced micro-fabrication technology effectively miniaturized the microwave biosensor (0.006λ0 × 0.005λ0) and enhanced its filling factor. The proposed microwave biosensor demonstrates a high sensitivity of 117.5 MHz/mgmL-1 with a linear response (r2 = 0.9987), good amplitude variation of 0.49 dB/mgmL-1 with a linear response (r2 = 0.9954), and maximum reproducibility of 0.78% at 2 mg/mL. Additionally, mathematical modelling was performed to estimate the dielectric value of the frequency-dependent glucose sample. The measured and analyzed results indicate that the proposed biosensor is suitable for real-time blood glucose detection measurements.
Glycosyltransferases (GTs) play fundamental roles in nearly all cellular processes through the biosynthesis of complex carbohydrates and glycosylation of diverse protein and small molecule substrates. The extensive structural and functional diversification of GTs presents a major challenge in mapping the relationships connecting sequence, structure, fold and function using traditional bioinformatics approaches. Here, we present a convolutional neural network with attention (CNN-attention) based deep learning model that leverages simple secondary structure representations generated from primary sequences to provide GT fold prediction with high accuracy. The model learns distinguishing secondary structure features free of primary sequence alignment constraints and is highly interpretable. It delineates sequence and structural features characteristic of individual fold types, while classifying them into distinct clusters that group evolutionarily divergent families based on shared secondary structural features. We further extend our model to classify GT families of unknown folds and variants of known folds. By identifying families that are likely to adopt novel folds such as GT91, GT96 and GT97, our studies expand the GT fold landscape and prioritize targets for future structural studies.
Protein prenylation by farnesyltransferase (FTase) is often described as the targeting of a cysteine-containing motif (CaaX) that is enriched for aliphatic amino acids at the a1 and a2 positions, while quite flexible at the X position. Prenylation prediction methods often rely on these features despite emerging evidence that FTase has broader target specificity than previously considered. Using a machine learning approach and training sets based on canonical (prenylated, proteolyzed, and carboxymethylated) and recently identified shunted motifs (prenylation only), this study aims to improve prenylation predictions with the goal of determining the full scope of prenylation potential among the 8000 possible Cxxx sequence combinations. Further, this study aims to subdivide the prenylated sequences as either shunted (i.e., uncleaved) or cleaved (i.e., canonical). Predictions were determined for Saccharomyces cerevisiae FTase and compared to results derived using currently available prenylation prediction methods. In silico predictions were further evaluated using in vivo methods coupled to two yeast reporters, the yeast mating pheromone a-factor and Hsp40 Ydj1p, that represent proteins with canonical and shunted CaaX motifs, respectively. Our machine learning-based approach expands the repertoire of predicted FTase targets and provides a framework for functional classification.
Motivation The human genome encodes over 500 distinct protein kinases which regulate nearly all cellular processes by the specific phosphorylation of protein substrates. While advances in mass spectrometry and proteomics studies have identified thousands of phosphorylation sites across species, information on the specific kinases that phosphorylate these sites is currently lacking for the vast majority of phosphosites. Recently, there has been a major focus on the development of computational models for predicting kinase-substrate associations. However, most current models only allow predictions on a subset of well-studied kinases. Furthermore, the utilization of hand-curated features and imbalances in training and testing datasets pose unique challenges in the development of accurate predictive models for kinase-specific phosphorylation prediction. Motivated by the recent development of universal protein language models which automatically generate context-aware features from primary sequence information, we sought to develop a unified framework for kinase-specific phosphosite prediction, allowing for greater investigative utility and enabling substrate predictions at the whole kinome level. Results We present a deep learning model for kinase-specific phosphosite prediction, termed Phosformer, which predicts the probability of phosphorylation given an arbitrary pair of unaligned kinase and substrate peptide sequences. We demonstrate that Phosformer implicitly learns evolutionary and functional features during training, removing the need for feature curation and engineering. Further analyses reveal that Phosformer also learns substrate specificity motifs and is able to distinguish between functionally-distinct kinase families. Benchmarks indicate that Phosformer exhibits significant improvements compared to the state-of-the-art models, while also presenting a more generalized, unified, and interpretable predictive framework. Availability Code and data are available at https://github.com/esbgkannan/phosformer. Supplementary information Supplementary data are available at Bioinformatics online.
Protein language modeling is a fast-emerging deep learning method in bioinformatics with diverse applications such as structure prediction and protein design. However, application toward estimating sequence conservation for functional site prediction has not been systematically explored. Here, we present a method for the alignment-free estimation of sequence conservation using sequence embeddings generated from protein language models. Comprehensive benchmarks across publicly available protein language models reveal that ESM2 models provide the best performance to computational cost ratio for conservation estimation. Applying our method to full-length protein sequences, we demonstrate that embedding-based methods are not sensitive to the order of conserved elements—conservation scores can be calculated for multidomain proteins in a single run, without the need to separate individual domains. Our method can also identify conserved functional sites within fast-evolving sequence regions (such as domain inserts), which we demonstrate through the identification of conserved phosphorylation motifs in variable insert segments in protein kinases. Overall, embedding-based conservation analysis is a broadly applicable method for identifying potential functional sites in any full-length protein sequence and estimating conservation in an alignment-free manner. To run this on your protein sequence of interest, try our scripts at https://github.com/esbgkannan/kibby.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.