2021
DOI: 10.1093/nargab/lqab004
|View full text |Cite
|
Sign up to set email alerts
|

Interpretable detection of novel human viruses from genome sequencing data

Abstract: Viruses evolve extremely quickly, so reliable methods for viral host prediction are necessary to safeguard biosecurity and biosafety alike. Novel human-infecting viruses are difficult to detect with standard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from next-generation sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homology-based algorithms, cutting the error rates in half and generalizin… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

3
60
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 38 publications
(73 citation statements)
references
References 71 publications
3
60
0
Order By: Relevance
“…Throughout this paper, we will use the term subread in a special sense: the first k nucleotides of a given sequencing read (in other words, a prefix of a read). The original DeePaC and DeePaC-vir datasets [23, 24] consist of 250bp simulated Illumina reads. The training, validation and held-out test sets contain mixtures of reads originating from different viruses or bacterial species with confirmed labels [32, 33], explicitly modeling generalization to “novel” (i.e.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…Throughout this paper, we will use the term subread in a special sense: the first k nucleotides of a given sequencing read (in other words, a prefix of a read). The original DeePaC and DeePaC-vir datasets [23, 24] consist of 250bp simulated Illumina reads. The training, validation and held-out test sets contain mixtures of reads originating from different viruses or bacterial species with confirmed labels [32, 33], explicitly modeling generalization to “novel” (i.e.…”
Section: Methodsmentioning
confidence: 99%
“…We investigated two architectures shown previously to perform well in the pathogenicity or host-range prediction task – a reverse-complement CNN consisting of 2 convolutional layers and 2 fully-connected layers and a reverse-complement bidirectional LSTM. For more design details and the description of the reverse-complement variants of convolutional and LSTM layers, we refer the reader to [23, 24]. Those architectures guarantee identical predictions for sequences in their forward and reverse-complement orientations in a single forward pass.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…These methods are capable of decomposing signal in high-dimensional genomic information (a limitation of regression frameworks) without the need for sequence alignment. Genomic machine learning analyses have demonstrated the ability to not only classify viruses from recurring viral genome motifs [28], but also classify their broad host origins [29][30][31][32]. Specifically considering coronaviruses, support vector machines and random forests have been trained on various genomic features to predict host group, including nucleotide and dinucleotide biases [33], amino acid composition [34] or sequence k-mers [35].…”
Section: Introductionmentioning
confidence: 99%