One of the reasons for the fast spread of SARS-CoV-2 is the lack of accuracy in detection tools in the clinical field. Molecular techniques, such as quantitative real-time RT-PCR and nucleic acid sequencing methods, are widely used to identify pathogens. For this particular virus, however, they have an overall unsatisfying detection rate, due to its relatively recent emergence and still not completely understood features. In addition, SARS-CoV-2 is remarkably similar to other Coronaviruses, and it can present with other respiratory infections, making identification even harder. To tackle this issue, we propose an assisted detection test, combining molecular testing with deep learning. The proposed approach employs a state-of-the-art deep convolutional neural network, able to automatically create features starting from the genome sequence of the virus.Experiments on data from the Novel Coronavirus Resource (2019nCoVR) show that the proposed approach is able to correctly classify SARS-CoV-2, distinguishing it from other coronavirus strains, such as MERS-CoV, HCoV-NL63, HCoV-OC43, HCoV-229E, HCoV-HKU1, and SARS-CoV regardless of missing information and errors in sequencing (noise). From a dataset of 553 complete genome non-repeated sequences that vary from 1,260 to 31,029 bps in length, the proposed approach classifies the different coronaviruses with an average ac-: bioRxiv preprint curacy of 98.75% in a 10-fold cross-validation, identifying SARS-CoV-2 with an AUC of 98%, specificity of 0.9939 and sensitivity of 1.00 in a binary classification. Then, using the same basis, we classify SARS-CoV-2 from 384 complete viral genome sequences with human host, that contain the gene ORF1ab from the NCBI with a 10-fold accuracy of 98.17% , a specificity of 0.9797 and sensitivity of 1.00. Furthermore, an in-depth analysis of the results allow us to identify base pairs sequences that are unique to SARS-CoV-2 and do not appear in other virus strains, that could then be used as a base for designing new primers and examined by experts to extract further insights. These preliminary results seem encouraging enough to identify deep learning as a promising research venue to develop assisted detection tests for SARS-CoV-2. At this end the interaction between viromics and deep learning, will hopefully help to solve global infection problems. In addition, we offer our code and processed data to be used for diagnostic purposes by medical doctors, virologists and scientists involved in solving the SARS-CoV-2 pandemic. As more data become available we will update our system.
In this paper, deep learning is coupled with explainable artificial intelligence techniques for the discovery of representative genomic sequences in SARS-CoV-2. A convolutional neural network classifier is first trained on 553 sequences from the National Genomics Data Center repository, separating the genome of different virus strains from the Coronavirus family with 98.73% accuracy. The network’s behavior is then analyzed, to discover sequences used by the model to identify SARS-CoV-2, ultimately uncovering sequences exclusive to it. The discovered sequences are validated on samples from the National Center for Biotechnology Information and Global Initiative on Sharing All Influenza Data repositories, and are proven to be able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. Next, one of the sequences is selected to generate a primer set, and tested against other state-of-the-art primer sets, obtaining competitive results. Finally, the primer is synthesized and tested on patient samples (n = 6 previously tested positive), delivering a sensitivity similar to routine diagnostic methods, and 100% specificity. The proposed methodology has a substantial added value over existing methods, as it is able to both automatically identify promising primer sets for a virus from a limited amount of data, and deliver effective results in a minimal amount of time. Considering the possibility of future pandemics, these characteristics are invaluable to promptly create specific detection methods for diagnostics.
Myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) is a chronic disorder characterized by disabling fatigue. Several studies have sought to identify diagnostic biomarkers, with varying results. Here, we innovate this process by combining both mRNA expression and DNA methylation data. We performed recursive ensemble feature selection (REFS) on publicly available mRNA expression data in peripheral blood mononuclear cells (PBMCs) of 93 ME/CFS patients and 25 healthy controls, and found a signature of 23 genes capable of distinguishing cases and controls. REFS highly outperformed other methods, with an AUC of 0.92. We validated the results on a different platform (AUC of 0.95) and in DNA methylation data obtained from four public studies on ME/CFS (99 patients and 50 controls), identifying 48 gene-associated CpGs that predicted disease status as well (AUC of 0.97). Finally, ten of the 23 genes could be interpreted in the context of the derailed immune system of ME/CFS.
The SARS-CoV-2 variant B.1.1.7 lineage, also known as clade GR from Global Initiative on Sharing All Influenza Data (GISAID), Nextstrain clade 20B, or Variant Under Investigation in December 2020 (VUI – 202012/01), appears to have an increased transmissability in comparison to other variants. Thus, to contain and study this variant of the SARS-CoV-2 virus, it is necessary to develop a specific molecular test to uniquely identify it. Using a completely automated pipeline involving deep learning techniques, we designed a primer set which is specific to SARS-CoV-2 variant B.1.1.7 with >99% accuracy, starting from 8,923 sequences from GISAID. The resulting primer set is in the region of the synonymous mutation C16176T in the ORF1ab gene, using the canonical sequence of the variant B.1.1.7 as a reference. Further in-silico testing shows that the primer set’s sequences do not appear in different viruses, using 20,571 virus samples from the National Center for Biotechnology Information (NCBI), nor in other coronaviruses, using 487 samples from National Genomics Data Center (NGDC). In conclusion, the presented primer set can be exploited as part of a multiplexed approach in the initial diagnosis of Covid-19 patients, or used as a second step of diagnosis in cases already positive to Covid-19, to identify individuals carrying the B.1.1.7 variant.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.