One of the reasons for the fast spread of SARS-CoV-2 is the lack of accuracy in detection tools in the clinical field. Molecular techniques, such as quantitative real-time RT-PCR and nucleic acid sequencing methods, are widely used to identify pathogens. For this particular virus, however, they have an overall unsatisfying detection rate, due to its relatively recent emergence and still not completely understood features. In addition, SARS-CoV-2 is remarkably similar to other Coronaviruses, and it can present with other respiratory infections, making identification even harder. To tackle this issue, we propose an assisted detection test, combining molecular testing with deep learning. The proposed approach employs a state-of-the-art deep convolutional neural network, able to automatically create features starting from the genome sequence of the virus.Experiments on data from the Novel Coronavirus Resource (2019nCoVR) show that the proposed approach is able to correctly classify SARS-CoV-2, distinguishing it from other coronavirus strains, such as MERS-CoV, HCoV-NL63, HCoV-OC43, HCoV-229E, HCoV-HKU1, and SARS-CoV regardless of missing information and errors in sequencing (noise). From a dataset of 553 complete genome non-repeated sequences that vary from 1,260 to 31,029 bps in length, the proposed approach classifies the different coronaviruses with an average ac-: bioRxiv preprint curacy of 98.75% in a 10-fold cross-validation, identifying SARS-CoV-2 with an AUC of 98%, specificity of 0.9939 and sensitivity of 1.00 in a binary classification. Then, using the same basis, we classify SARS-CoV-2 from 384 complete viral genome sequences with human host, that contain the gene ORF1ab from the NCBI with a 10-fold accuracy of 98.17% , a specificity of 0.9797 and sensitivity of 1.00. Furthermore, an in-depth analysis of the results allow us to identify base pairs sequences that are unique to SARS-CoV-2 and do not appear in other virus strains, that could then be used as a base for designing new primers and examined by experts to extract further insights. These preliminary results seem encouraging enough to identify deep learning as a promising research venue to develop assisted detection tests for SARS-CoV-2. At this end the interaction between viromics and deep learning, will hopefully help to solve global infection problems. In addition, we offer our code and processed data to be used for diagnostic purposes by medical doctors, virologists and scientists involved in solving the SARS-CoV-2 pandemic. As more data become available we will update our system.
In this paper, deep learning is coupled with explainable artificial intelligence techniques for the discovery of representative genomic sequences in SARS-CoV-2. A convolutional neural network classifier is first trained on 553 sequences from the National Genomics Data Center repository, separating the genome of different virus strains from the Coronavirus family with 98.73% accuracy. The network’s behavior is then analyzed, to discover sequences used by the model to identify SARS-CoV-2, ultimately uncovering sequences exclusive to it. The discovered sequences are validated on samples from the National Center for Biotechnology Information and Global Initiative on Sharing All Influenza Data repositories, and are proven to be able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. Next, one of the sequences is selected to generate a primer set, and tested against other state-of-the-art primer sets, obtaining competitive results. Finally, the primer is synthesized and tested on patient samples (n = 6 previously tested positive), delivering a sensitivity similar to routine diagnostic methods, and 100% specificity. The proposed methodology has a substantial added value over existing methods, as it is able to both automatically identify promising primer sets for a virus from a limited amount of data, and deliver effective results in a minimal amount of time. Considering the possibility of future pandemics, these characteristics are invaluable to promptly create specific detection methods for diagnostics.
The SARS-CoV-2 variant B.1.1.7 lineage, also known as clade GR from Global Initiative on Sharing All Influenza Data (GISAID), Nextstrain clade 20B, or Variant Under Investigation in December 2020 (VUI – 202012/01), appears to have an increased transmissability in comparison to other variants. Thus, to contain and study this variant of the SARS-CoV-2 virus, it is necessary to develop a specific molecular test to uniquely identify it. Using a completely automated pipeline involving deep learning techniques, we designed a primer set which is specific to SARS-CoV-2 variant B.1.1.7 with >99% accuracy, starting from 8,923 sequences from GISAID. The resulting primer set is in the region of the synonymous mutation C16176T in the ORF1ab gene, using the canonical sequence of the variant B.1.1.7 as a reference. Further in-silico testing shows that the primer set’s sequences do not appear in different viruses, using 20,571 virus samples from the National Center for Biotechnology Information (NCBI), nor in other coronaviruses, using 487 samples from National Genomics Data Center (NGDC). In conclusion, the presented primer set can be exploited as part of a multiplexed approach in the initial diagnosis of Covid-19 patients, or used as a second step of diagnosis in cases already positive to Covid-19, to identify individuals carrying the B.1.1.7 variant.
As the COVID-19 pandemic persists, new SARS-CoV-2 variants with potentially dangerous features have been identified by the scientific community. Variant B.1.1.7 lineage clade GR from Global Initiative on Sharing All Influenza Data (GISAID) was first detected in the UK, and it appears to possess an increased transmissibility. At the same time, South African authorities reported variant B.1.351, that shares several mutations with B.1.1.7, and might also present high transmissibility. Even more recently, a variant labeled P.1 with 17 non-synonymous mutations was detected in Brazil. In such a situation, it is paramount to rapidly develop specific molecular tests to uniquely identify, contain, and study new variants. Using a completely automated pipeline built around deep learning techniques, we design primer sets specific to variant B.1.1.7, B.1.351, and P.1, respectively. Starting from sequences openly available in the GISAID repository, our pipeline was able to deliver the primer sets in just under 16 hours for each case study. In-silico tests show that the sequences in the primer sets present high accuracy and do not appear in samples from different viruses, nor in other coronaviruses or SARS-CoV-2 variants. The presented methodology can be exploited to swiftly obtain primer sets for each independent new variant, that can later be a part of a multiplexed approach for the initial diagnosis of COVID-19 patients. Furthermore, since our approach delivers primers able to differentiate between variants, it can be used as a second step of a diagnosis in cases already positive to COVID-19, to identify individuals carrying variants with potentially threatening features.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.