Turtle: Identifying frequent 
            <i>k</i>
            -mers with cache-efficient algorithms

Roy, Raj; Bhattacharya, Debashish; Schliep, Alexander

doi:10.1093/bioinformatics/btu132

Cited by 61 publications

(37 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Numerous software packages can organize the raw sequencing data of each individual into comprehensive k -mer lists 28, 31–34 , which can be later used for fast retrieval of k -mer counts. However, the compilation of full-genome lists is somewhat inefficient if the lists are only used once and then immediately deleted.…”

Section: Discussionmentioning

confidence: 99%

FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads

Pajuste

Kaplinski

Möls

et al. 2017

Sci Rep

View full text Add to dashboard Cite

We have developed a computational method that counts the frequencies of unique k-mers in FASTQ-formatted genome data and uses this information to infer the genotypes of known variants. FastGT can detect the variants in a 30x genome in less than 1 hour using ordinary low-cost server hardware. The overall concordance with the genotypes of two Illumina “Platinum” genomes is 99.96%, and the concordance with the genotypes of the Illumina HumanOmniExpress is 99.82%. Our method provides k-mer database that can be used for the simultaneous genotyping of approximately 30 million single nucleotide variants (SNVs), including >23,000 SNVs from Y chromosome. The source code of FastGT software is available at GitHub (https://github.com/bioinfo-ut/GenomeTester4/).

show abstract

Section: Discussionmentioning

confidence: 99%

FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads

Pajuste

Kaplinski

Möls

et al. 2017

Sci Rep

View full text Add to dashboard Cite

show abstract

“…Both software were run for 4 values of k simultaneously, k=31; 47; 63; 79 using four threads for parallelism. As a comparison running a fast k-mer-counter, scTurtle (Roy et al, 2014) took 3594s using eight cores and 26 G of memory, for a single value of k = 31 for B. Impatiens.…”

Section: Comparison To Kmergeniementioning

confidence: 99%

“…Much work has been done on reducing memory requirements, based on exact or approximately correct methods of keeping track of a large set of k-mers, this work includes using succinct set representations (Conway and Bromage, 2011) or probabilistic encodings such as Bloom filters (Chikhi and Rizk, 2012;Melsted and Pritchard, 2011;Pell et al, 2012), whereas recent advances have focused on more speed (Deorowicz et al, 2013;Roy et al, 2014). Although the impact on memory usage is considerable, compared to previous approaches, these methods require storing all k-mers, explicitly or implicitly, in memory.…”

Section: Introductionmentioning

confidence: 99%

KmerStream: streaming algorithms for k -mer abundance estimation

Melsted

Halldórsson

2014

Bioinformatics

View full text Add to dashboard Cite

Motivation: Several applications in bioinformatics, such as genome assemblers and error corrections methods, rely on counting and keeping track of k-mers (substrings of length k). Histograms of k-mer frequencies can give valuable insight into the underlying distribution and indicate the error rate and genome size sampled in the sequencing experiment.Results: We present KmerStream, a streaming algorithm for estimating the number of distinct k-mers present in high-throughput sequencing data. The algorithm runs in time linear in the size of the input and the space requirement are logarithmic in the size of the input. We derive a simple model that allows us to estimate the error rate of the sequencing experiment, as well as the genome size, using only the aggregate statistics reported by KmerStream. As an application we show how KmerStream can be used to compute the error rate of a DNA sequencing experiment. We run KmerStream on a set of 2656 whole genome sequenced individuals and compare the error rate to quality values reported by the sequencing equipment. We discover that while the quality values alone are largely reliable as a predictor of error rate, there is considerable variability in the error rates between sequencing runs, even when accounting for reported quality values.

show abstract

“…Other tools like DSK [7] and KMC [8] exploit a two-disk architecture and aim at reducing expensive IO operations. Turtle [9] replaces a standard Bloom filter by a cache-efficient counterpart. MSPKmerCounter [10] introduces the concept of minimizers to the k -mer counting, thus further optimizing the disk-based approach.…”

Section: Introductionmentioning

confidence: 99%

Gerbil: a fast and memory-efficient k-mer counter with GPU-support

Erbert

Rechner

Müller‐Hannemann

2017

Algorithms Mol Biol

View full text Add to dashboard Cite

BackgroundA basic task in bioinformatics is the counting of k-mers in genome sequences. Existing k-mer counting tools are most often optimized for small k < 32 and suffer from excessive memory resource consumption or degrading performance for large k. However, given the technology trend towards long reads of next-generation sequencers, support for large k becomes increasingly important.ResultsWe present the open source k-mer counting software Gerbil that has been designed for the efficient counting of k-mers for k ≥ 32. Our software is the result of an intensive process of algorithm engineering. It implements a two-step approach. In the first step, genome reads are loaded from disk and redistributed to temporary files. In a second step, the k-mers of each temporary file are counted via a hash table approach. In addition to its basic functionality, Gerbil can optionally use GPUs to accelerate the counting step. In a set of experiments with real-world genome data sets, we show that Gerbil is able to efficiently support both small and large k.ConclusionsWhile Gerbil’s performance is comparable to existing state-of-the-art open source k-mer counting tools for small k < 32, it vastly outperforms its competitors for large k, thereby enabling new applications which require large values of k.Electronic supplementary materialThe online version of this article (doi:10.1186/s13015-017-0097-9) contains supplementary material, which is available to authorized users.

show abstract

Turtle: Identifying frequent k -mers with cache-efficient algorithms

Abstract: The tools are freely available for download at http://bioinformatics.rutgers.edu/Software/Turtle and http://figshare.com/articles/Turtle/791582.

Cited by 61 publications

References 21 publications

FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads

FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads

KmerStream: streaming algorithms for k -mer abundance estimation

Gerbil: a fast and memory-efficient k-mer counter with GPU-support

Contact Info

Product

Resources

About