B-cell is an essential component of the immune system that plays a vital role in providing the immune response against any pathogenic infection by producing antibodies. Existing methods either predict linear or conformational B-cell epitopes in an antigen. In this study, a single method was developed for predicting both types (linear/conformational) of B-cell epitopes. The dataset used in this study contains 3875 B-cell epitopes and 3996 non-B-cell epitopes, where B-cell epitopes consist of both linear and conformational B-cell epitopes. Our primary analysis indicates that certain residues (like Asp, Glu, Lys, Asn) are more prominent in B-cell epitopes. We developed machine-learning based methods using different types of sequence composition and achieved the highest AUC of 0.80 using dipeptide composition. In addition, models were developed on selected features, but no further improvement was observed. Our similarity-based method implemented using BLAST shows a high probability of correct prediction with poor sensitivity. Finally, we came up with a hybrid model that combine alignment free (dipeptide based random forest model) and alignment-based (BLAST based similarity) model. Our hybrid model attained maximum AUC 0.83 with MCC 0.49 on the independent dataset. Our hybrid model performs better than existing methods on an independent dataset used in this study. All models trained and tested on 80% data using cross-validation technique and final model was evaluated on 20% data called independent or validation dataset. A webserver and standalone package named "CLBTope" has been developed for predicting, designing, and scanning B-cell epitopes in an antigen sequence (https://webs.iiitd.edu.in/raghava/clbtope/).
The global efforts to control COVID-19 are threatened by the rapid emergence of novel SARS-CoV-2 variants that may display undesirable characteristics such as immune escape, increased transmissibility or pathogenicity. Early prediction for emergence of new strains with these features is critical for pandemic preparedness. We present Strainflow, a supervised and causally predictive model using unsupervised latent space features of SARS-CoV-2 genome sequences. Strainflow was trained and validated on 0.9 million sequences for the period December, 2019 to June, 2021 and the frozen model was prospectively validated from July, 2021 to December, 2021. Strainflow captured the rise in cases 2 months ahead of the Delta and Omicron surges in most countries including the prediction of a surge in India as early as beginning of November, 2021. Entropy analysis of Strainflow unsupervised embeddings clearly reveals the explore-exploit cycles in genomic feature-space, thus adding interpretability to the deep learning based model. We also conducted codon-level analysis of our model for interpretability and biological validity of our unsupervised features. Strainflow application is openly available as an interactive web-application for prospective genomic surveillance of COVID-19 across the globe.
The global efforts to control COVID-19 are threatened by the rapid emergence of novel variants that may display undesirable characteristics such as immune escape or increased pathogenicity. The current approaches to genomic surveillance do not allow early prediction of emerging variations. Here, we derive Dimensions of Concern (DoC) in the latent space of SARS-CoV-2 mutations and demonstrate their potential to provide a lead time for predicting the increase of new cases in 9 countries across the globe. We learned unsupervised word embeddings from 3,09,060 spike protein coding sequences deposited on GISAID database until April, 2021. We discovered that "blips" in the latent dimensions of embeddings are associated with mutations. We modeled the temporal occurrence of blips and their relationships with the number of new cases in the following months for these countries. Certain dimensions demonstrated a consistent leading relationship between the occurrence of blips and the number of new cases in the following months, thus labeled as potential Dimensions of Concern, DoCs. We validated the predictive importance of DoCs by performing Random Forest-based feature selection and modeling in a temporally split training, validation, testing regime. Twelve dimensions achieved statistical significance and achieved an R-squared of 37% for prediction of number of new cases in the following month. Biological exploration of DoCs revealed that dimensions 3 and 12 captures 3-mers CGG, ACG and CAC that are associated with known variants L452R, K417T and Q677H respectively. Learning and tracking DoCs is extensible to related challenges such as pandemic preparedness, immune escape, pathogenicity modeling and antimicrobial resistance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.