Motivation In silico identification of linear B-cell epitopes represents an important step in the development of diagnostic tests and vaccine candidates, by providing potential high-probability targets for experimental investigation. Current predictive tools were developed under a generalist approach, training models with heterogeneous data sets to develop predictors that can be deployed for a wide variety of pathogens. However, continuous advances in processing power and the increasing amount of epitope data for a broad range of pathogens indicate that training organism or taxon-specific models may become a feasible alternative, with unexplored potential gains in predictive performance. Results This paper shows how organism-specific training of epitope prediction models can yield substantial performance gains across several quality metrics when compared to models trained with heterogeneous and hybrid data, and with a variety of widely-used predictors from the literature. These results suggest a promising alternative for the development of custom-tailored predictive models with high predictive power, which can be easily implemented and deployed for the investigation of specific pathogens. Availability The data underlying this article, as well as the full reproducibility scripts, are available at https://github.com/fcampelo/OrgSpec-paper. The R package that implements the organism-specific pipeline functions is available at https://github.com/fcampelo/epitopes. Supplementary information Supplementary materials are available at Bioinformatics online.
BackgroundThe identification of linear B-cell epitopes remains an important task in the development of vaccines, therapeutic antibodies and several diagnostic tests. Machine learning predictors are trained to flag potential epitope candidates for experimental validation and currently, most predictors are trained as generalist models using large, heterogeneous data sets. Recently, organism-specific training has been shown to improve prediction performance for data-rich organisms. Unfortunately, for most organisms, large volumes of validated epitope data are not yet available. This article investigates the limits of organism-specific training for epitope prediction. It explores the validity of organism-specific training for data-poor organisms by examining how the size of the training data set affects prediction performance. It also compares the performance of organism-specific training under simulated data-poor conditions to that of models trained using traditional large heterogeneous and hybrid data sets.ResultsThis work shows how models trained on small organism-specific data sets can outperform similar models trained on (potentially much larger) heterogeneous and mixed data sets. The results reported indicate that as few as 20 labelled peptides from a given pathogen can be sufficient to generate models that outperform widely-used predictors from the literature, which are trained on heterogeneous data. Models trained using more than about 100 to 150 organism-specific peptides perform consistently better than most generalist models across a wide variety of performance measures, and in some cases can even approach the performance of organism-specific models trained on considerably larger data sets.ConclusionsOrganism-specific training improves linear B-cell epitope prediction performance even in situations when only small training sets are available, which opens new possibilities for the development of bespoke, high-performance predictive models when studying data-poor organisms such as emerging or neglected pathogens.
Monkeypox is a disease caused by the Monkeypox virus (MPXV), a double-stranded DNA virus from genus Orthopoxvirus under family Poxviridae, that has recently emerged as a global health threat after decades of local outbreaks in Central and Western Africa. Effective epidemiological control against this disease requires the development of cheaper, faster diagnostic tools to monitor its spread, including antigen and serological testing. There is, however, little available information about MPXV epitopes, particularly those that would be effective in discriminating between MPXV infections and those by other virus from the same family. We used the available data from the Immune Epitope Database (IEDB) to generate and validate a predictive model optimised for detecting linear B-cell epitopes (LBCEs) from Orthopoxvirus, based on a phylogeny-aware data selection strategy. By coupling this predictive approach with conservation and similarity analyses, we identified nine specific peptides from MPXV that are likely to represent distinctive LBCEs for the diagnostic of Monkeypox infections, including the independent detection of a known epitope experimentally characterised as a potential specific diagnostic target for MPXV. The results obtained indicate ability of the proposed pipeline to uncover promising targets for the development of cheaper, more specific diagnostic tests for this emerging viral disease.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.