The COVID-19 pandemic has sparked an urgent need to uncover the underlying biology of this devastating disease. Though RNA viruses mutate more rapidly than DNA viruses, there are a relatively small number of single nucleotide polymorphisms (SNPs) that differentiate the main SARS-CoV-2 lineages that have spread throughout the world. In this study, we investigated 129 RNA-seq data sets and 6928 consensus genomes to contrast the intra-host and inter-host diversity of SARS-CoV-2. Our analyses yielded three major observations. First, the mutational profile of SARS-CoV-2 highlights intra-host single nucleotide variant (iSNV) and SNP similarity, albeit with differences in C > U changes. Second, iSNV and SNP patterns in SARS-CoV-2 are more similar to MERS-CoV than SARS-CoV-1. Third, a significant fraction of insertions and deletions contribute to the genetic diversity of SARS-CoV-2. Altogether, our findings provide insight into SARS-CoV-2 genomic diversity, inform the design of detection tests, and highlight the potential of iSNVs for tracking the transmission of SARS-CoV-2.
The wide prevalence and regulated expression of long noncoding RNAs (lncRNAs) highlight their functional roles, but the molecular basis for their activities and structure-function relationships remains to be investigated, with few exceptions. Among the relatively few lncRNAs conserved over significant evolutionary distances is the long intergenic noncoding RNA (lincRNA) Cyrano (orthologous to human OIP5-AS1), which contains a region of 300 highly conserved nucleotides within tetrapods, which in turn contains a functional stretch of 26 nt of deep conservation. This region binds to and facilitates the degradation of the microRNA miR-7, a short ncRNA with multiple cellular functions, including modulation of oncogenic expression. We probed the secondary structure of Cyrano in vitro and in cells using chemical and enzymatic probing, and validated the results using comparative sequence analysis. At the center of the functional core of Cyrano is a cloverleaf structure maintained over the >400 million years of divergent evolution that separates fish and primates. This strikingly conserved motif provides interaction sites for several RNA-binding proteins and masks a conserved recognition site for miR-7. Conservation in this region strongly suggests that the function of Cyrano depends on the formation of this RNA structure, which could modulate the rate and efficiency of degradation of miR-7.
The COVID-19 pandemic has emphasized the importance of accurate detection of known and emerging pathogens. However, robust characterization of pathogenic sequences remains an open challenge. To address this need we developed SeqScreen, which accurately characterizes short nucleotide sequences using taxonomic and functional labels and a customized set of curated Functions of Sequences of Concern (FunSoCs) specific to microbial pathogenesis. We show our ensemble machine learning model can label protein-coding sequences with FunSoCs with high recall and precision. SeqScreen is a step towards a novel paradigm of functionally informed synthetic DNA screening and pathogen characterization, available for download at www.gitlab.com/treangenlab/seqscreen.
Modern benchtop DNA synthesis techniques and increased concern of emerging pathogens have elevated the importance of screening oligonucleotides for pathogens of concern. However, accurate and sensitive characterization of oligonucleotides is an open challenge for many of the current techniques and ontology-based tools. To address this gap, we have developed a novel software tool, SeqScreen, that can accurately and sensitively characterize short DNA sequences using a set of curated Functions of Sequences of Concern (FunSoCs), novel functional labels specific to microbial pathogenesis which describe the pathogenic potential of individual proteins. We show that our ensemble machine learning model after training on these curations can label sequences with FunSoCs via an imbalanced multi-class and multi-label classification task with high accuracy. In summary, SeqScreen represents a first step towards a novel paradigm of functionally informed pathogen characterization from genomic and metagenomic datasets. SeqScreen is open-source and freely available for download at: https://www.gitlab.com/treangenlab/seqscreen .
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.