2019
DOI: 10.1016/j.amc.2019.02.018
|View full text |Cite
|
Sign up to set email alerts
|

Improving MinHash via the containment index with applications to metagenomic analysis

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
70
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
7
2

Relationship

3
6

Authors

Journals

citations
Cited by 32 publications
(71 citation statements)
references
References 15 publications
1
70
0
Order By: Relevance
“…We use KMC [18] to generate k-mers for the reads and each reference genome, and then, we utilize an implementation of the theoretical concept of containment min hash [14] (called CMash) to estimate the percent of k-mers in each reference genome that are also present in the reads (the "Methods" section). Intuitively, this gives us an estimate of how likely it is for each reference genome to be present in the sample.…”
Section: Methods Overviewmentioning
confidence: 99%
See 1 more Smart Citation
“…We use KMC [18] to generate k-mers for the reads and each reference genome, and then, we utilize an implementation of the theoretical concept of containment min hash [14] (called CMash) to estimate the percent of k-mers in each reference genome that are also present in the reads (the "Methods" section). Intuitively, this gives us an estimate of how likely it is for each reference genome to be present in the sample.…”
Section: Methods Overviewmentioning
confidence: 99%
“…Alignment-based profiling is regarded as highly accurate, but aligning millions of reads against a reference database of tens to hundreds of gigabytes (GB) in size is computationally infeasible. Metalign minimizes computational cost with a high-speed, high-recall pre-filtering method based on the mathematical concept of containment min hash [14], which identifies a small number of candidate organisms that are potentially in the sample and creates a subset database consisting of these organisms. This pre-filtering approach reduced our comprehensive NCBI-based database of 243 GB, often by more than 100-fold, with some variance depending on the diversity of the sample.…”
Section: Introductionmentioning
confidence: 99%
“…First, KMC 17 is used to enumerate the k-mers in the reads, with the k-mers of the reference genomes having been pre-computed by KMC, and then intersect these sets. We then utilize the containment MinHash similarity metric (presented theoretically in 14 via an implementation by one of the coauthors (Koslicki) called CMash to efficiently estimate the similarity/containment index) between each reference genome and the input sample. The containment index is closely related to the Jaccard index, and, in this case, refers to the percent of k-mers in a reference genome that are also present in the reads.…”
Section: Database Pre-filtering With Cmashmentioning
confidence: 99%
“…However, Blast is not designed for high-throughput metagenomic reads classification, and its computationally expensive to get local alignments for hundreds of thousands and millions of reads. Other alignment-based techniques can be mapping-based – using methods like the Burrows-Wheeler transform (BWT) or variants of hash tables [ 11 , 34 ]. While enabling fast queries, mapping tools need to spend “training time” to compress/prefilter the reference database.…”
Section: Related Workmentioning
confidence: 99%