To better understand the molecular basis of respiratory diseases of viral origin, high-throughput gene-expression data are frequently taken by means of DNA microarray or RNA-seq technology. Such data can also be useful to classify infected individuals by molecular signatures in the form of machine-learning models with genes as predictor variables. Early diagnosis of patients by molecular signatures could also contribute to better treatments. An approach that has rarely been considered for machine-learning models in the context of transcriptomics is data augmentation. For other data types it has been shown that augmentation can improve classification accuracy and prevent overfitting. Here, we compare three strategies for data augmentation of DNA microarray and RNA-seq data from two selected studies on respiratory diseases of viral origin. The first study involves samples of patients with either viral or bacterial origin of the respiratory disease, the second study involves patients with either SARS-CoV-2 or another respiratory virus as disease origin. Specifically, we reanalyze these public datasets to study whether patient classification by transcriptomic signatures can be improved when adding artificial data for training of the machine-learning models. Our comparison reveals that augmentation of transcriptomic data can improve the classification accuracy and that fewer genes are necessary as explanatory variables in the final models. We also report genes from our signatures that overlap with signatures presented in the original publications of our example data. Due to strict selection criteria, the molecular role of these genes in the context of respiratory infectious diseases is underlined.
IntroductionNaturally attenuated Langat virus (LGTV) and highly pathogenic tick-borne encephalitis virus (TBEV) share antigenically similar viral proteins and are grouped together in the same flavivirus serocomplex. In the early 1970s, this has encouraged the usage of LGTV as a potential live attenuated vaccine against tick-borne encephalitis (TBE) until cases of encephalitis were reported among vaccinees. Previously, we have shown in a mouse model that immunity induced against LGTV protects mice against lethal TBEV challenge infection. However, the immune correlates of this protection have not been studied.MethodsWe used the strategy of adoptive transfer of either serum or T cells from LGTV infected mice into naïve recipient mice and challenged them with lethal dose of TBEV.ResultsWe show that mouse infection with LGTV induced both cross-reactive antibodies and T cells against TBEV. To identify correlates of protection, Monitoring the disease progression in these mice for 16 days post infection, showed that serum from LGTV infected mice efficiently protected from developing severe disease. On the other hand, adoptive transfer of T cells from LGTV infected mice failed to provide protection. Histopathological investigation of infected brains suggested a possible role of microglia and T cells in inflammatory processes within the brain.DiscussionOur data provide key information regarding the immune correlates of protection induced by LGTV infection of mice which may help design better vaccines against TBEV.
Background: Estimating the taxonomic composition of viral sequences in a biological sample processed by next-generation sequencing is an important step for comparative metagenomics. For that purpose, sequencing reads are usually classified by mapping them against a database of known viral reference genomes. This fails, however, to classify reads from novel viruses and quasispecies whose reference sequences are not yet available in public databases. Methods: In order to circumvent the problem of a mapping approach with unknown viruses, the feasibility and performance of neural networks to classify sequencing reads to taxonomic classes is studied. For that purpose, taxonomy and genome data from the NCBI database are used to sample artificial reads from known viruses with known taxonomic attribution. Based on these training data, artificial neural networks are fitted and applied to classify single viral read sequences to di erent taxa. Model building includes di erent input features derived from artificial read sequences as possible predictors which are chosen by a feature selection method. Training, validation and test data are computed from these input features. To summarise classification results, a generalised confusion matrix is proposed which lists all possible misclassification combination frequencies. Two new formulas to statistically estimate taxa frequencies are introduced for studying the overall viral composition.Results: We found that the best taxonomic level supported by the NCBI database is that of viral orders. Prediction accuracy of the fitted models is evaluated on test data and classification results are summarised in a confusion matrix, from which diagnostic measures such as sensitivity and specificity as well as positive and negative predictive values are calculated. The prediction accuracy of the artificial neural net is considerably higher than for random classification and posterior estimation of taxa frequencies is closer to the true distribution in the training data than simple classification or mapping results. Conclusions: Neural networks are helpful to classify sequencing reads into viral orders and can be used to complement the results of mapping approaches. The machine learning approach is not limited to already known viruses. In addition, statistical estimations of taxa frequencies can be used for subsequent comparative metagenomics.
Outliers in the training or test set used to fit and evaluate a classifier on transcriptomics data can considerably change the estimated performance of the model. Hence, an either too weak or a too optimistic accuracy is then reported and the estimated model performance cannot be reproduced on independent data. It is then also doubtful whether a classifier qualifies for clinical usage. We estimate classifier performances in simulated gene expression data with artificial outliers and in two real-world datasets. As a new approach, we use two outlier detection methods within a bootstrap procedure to estimate the outlier probability for each sample and evaluate classifiers before and after outlier removal by means of cross-validation. We found that the removal of outliers changed the classification performance notably. For the most part, removing outliers improved the classification results. Taking into account the fact that there are various, sometimes unclear reasons for a sample to be an outlier, we strongly advocate to always report the performance of a transcriptomics classifier with and without outliers in training and test data. This provides a more diverse picture of a classifier’s performance and prevents reporting models that later turn out to be not applicable for clinical diagnoses.
Estimating the taxonomic composition of viral sequences in a biological samples processed by next-generation sequencing is an important step in comparative metagenomics. Mapping sequencing reads against a database of known viral reference genomes, however, fails to classify reads from novel viruses whose reference sequences are not yet available in public databases. Instead of a mapping approach, and in order to classify sequencing reads at least to a taxonomic level, the performance of artificial neural networks and other machine learning models was studied. Taxonomic and genomic data from the NCBI database were used to sample labelled sequencing reads as training data. The fitted neural network was applied to classify unlabelled reads of simulated and real-world test sets. Additional auxiliary test sets of labelled reads were used to estimate the conditional class probabilities, and to correct the prior estimation of the taxonomic distribution in the actual test set. Among the taxonomic levels, the biological order of viruses provided the most comprehensive data base to generate training data. The prediction accuracy of the artificial neural network to classify test reads to their viral order was considerably higher than that of a random classification. Posterior estimation of taxa frequencies could correct the primary classification results.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.