BinDash, software for fast genome distance estimation on a typical personal laptop

Zhao, XiaoFei

doi:10.1093/bioinformatics/bty651

Cited by 56 publications

(52 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sets were constructed to have target Jaccard coefficients ranging from 0.00022 to 0.818. Many set-size pairs were evaluated ranging from equal-size sets to sets with sizes differing by a factor of 2 12 . In total, we evaluated 36 combinations of set size and J were evaluated, with full results presented in Additional File 2.…”

Section: Sketch Accuracymentioning

confidence: 99%

“…Spurred by MinHash's utility, other groups have proposed alternatives using new ideas from search and data mining. BinDash [12] uses a b-bit one-permutation rolling MinHash to achieve greater accuracy and speed compared to Mash at a smaller memory footprint. Other theoretical improvements are proposed in the HyperMin-Hash [13] and SuperMinHash [14] studies.…”

Section: Introductionmentioning

confidence: 99%

“…We implemented the HLL in the Dashing software tool (https://github.com/ dnbaker/dashing), which is free and open source under the GPLv3 license. Dashing supports the functions available in similar tools like Mash [1], BinDash [12] and Sourmash [21]. Dashing can build a sketch of an input sequence set (dashing sketch), including FASTA files (for assembled genomes) or FASTQ files (for sequencing datasets).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Dashing: Fast and Accurate Genomic Distances with HyperLogLog

Baker

Langmead

2018

Preprint

View full text Add to dashboard Cite

Dashing is a fast and accurate software tool for estimating similarities of genomes or sequencing datasets. It uses the HyperLogLog sketch together with cardinality estimation methods that are specialized for set unions and intersections. Dashing summarizes genomes more rapidly than previous MinHash-based methods while providing greater accuracy across a wide range of input sizes and sketch sizes. It can sketch and calculate pairwise distances for over 87K genomes in 6 minutes. Dashing is open source and available at https://github.com/dnbaker/dashing.

show abstract

Section: Sketch Accuracymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Dashing: Fast and Accurate Genomic Distances with HyperLogLog

Baker

Langmead

2018

Preprint

View full text Add to dashboard Cite

show abstract

“…For estimating resemblance, Mash uses a 'bottom sketch' strategy as originally proposed by Broder [8]. More efficient techniques for estimating resemblance have since emerged [9,10,11,12], but bottom sketching is elegant in its simplicity. In short, all k-mers from a genome A are passed through a single hash function h but only the smallest m hash values are stored as the sketch S(A), where |S(A)| << |A|.…”

Section: Introductionmentioning

confidence: 99%

Mash Screen: High-throughput sequence containment estimation for genome discovery

Ondov

Starrett

Sappington

et al. 2019

Preprint

View full text Add to dashboard Cite

The MinHash algorithm has proven effective for rapidly estimating the resemblance of two genomes or metagenomes. However, this method cannot reliably estimate the containment of a genome within a metagenome. Here we describe an online algorithm capable of measuring the containment of genomes and proteomes within either assembled or unassembled sequencing read sets. We describe several use cases, including contamination screening and retrospective analysis of metagenomes for novel genome discovery. Using this tool, we provide containment estimates for every NCBI RefSeq genome within every SRA metagenome, and demonstrate the identification of a novel polyomavirus species from a public metagenome.

show abstract

“…The many-fold size reductions gained via MinHash open the door to extremely large scale searches. While the initial k-mer MinHash implementation focused on enabling Jaccard similarity comparisons (3), it has since been modified and extended to enable k-mer abundance comparisons (4), decrease runtime and memory requirements (5), and work on streaming input data (6). Furthermore, as Jaccard similarity is impacted by the relative size of the sets being compared, containment searches (2,7,8) have been developed to enable detection of a small set within a larger set, such as a genome within a metagenome.…”

Section: Introductionmentioning

confidence: 99%

Large-scale sequence comparisons with sourmash

Pierce

Irber

Reiter

et al. 2019

Preprint

View full text Add to dashboard Cite

The sourmash software package uses MinHash-based sketching to create "signatures", compressed representations of DNA, RNA, and protein sequences, that can be stored, searched, explored, and taxonomically annotated. sourmash signatures can be used to estimate sequence similarity between very large data sets quickly and in low memory, and can be used to search large databases of genomes for matches to query genomes and metagenomes. sourmash is implemented in C++, Rust, and Python, and is freely available under the BSD license at http://github.com/dib-lab/sourmash. bioinformatics, sequence analysis, MinHash, k-mer, sourmash

show abstract

BinDash, software for fast genome distance estimation on a typical personal laptop

Abstract: Supplementary data are available at Bioinformatics online.

Cited by 56 publications

References 3 publications

Dashing: Fast and Accurate Genomic Distances with HyperLogLog

Dashing: Fast and Accurate Genomic Distances with HyperLogLog

Mash Screen: High-throughput sequence containment estimation for genome discovery

Large-scale sequence comparisons with sourmash

Contact Info

Product

Resources

About