2017
DOI: 10.1093/bioinformatics/btw832
|View full text |Cite
|
Sign up to set email alerts
|

ntCard: a streaming algorithm for cardinality estimation in genomics data

Abstract: MotivationMany bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as k-mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique k-mers, or even better, to build a histogram of k-mer frequencies would be desirable for these tools and their downstream analysis pipelines. Among other applications, estimated frequencies can be used to predict genome sizes, measure … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
63
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
4
3
2

Relationship

3
6

Authors

Journals

citations
Cited by 68 publications
(67 citation statements)
references
References 30 publications
0
63
0
Order By: Relevance
“…The white boxes are FASTA files and the grey boxes represent the tools that process or generate them. Ntcard [24] is used to select the best-suited k-mer size. A compacted DBG is then constructed using Bcalm2 [25].…”
Section: Dbg-based Reads Correctionmentioning
confidence: 99%
“…The white boxes are FASTA files and the grey boxes represent the tools that process or generate them. Ntcard [24] is used to select the best-suited k-mer size. A compacted DBG is then constructed using Bcalm2 [25].…”
Section: Dbg-based Reads Correctionmentioning
confidence: 99%
“…Genomes were downloaded from ENSEMBL (Yates et al, 2016). The program ntCard (Mohamadi, Khan, & Birol, 2016) was used to estimate the number of distinct kmers (subsequences of length k) for each set of contaminant genomes and estimate the number of elements to be inserted into each Bloom filter (Table S1). All Bloom filters were created with a target false-positive rate (FPR) of 2%.…”
Section: Environmental Contaminant Screeningmentioning
confidence: 99%
“…We first run ntHits (v0.0.1; https://github.com/bcgsc/nthits; Supplemental Methods) to remove error kmers from high throughput sequencing data, and build a canonical representation of coverage-thresholded kmers [8] using a Bloom filter, while maintaining a low false positive rate (≈0.0005). The Bloom filter is then read by ntEdit (v1.1.0 with matching kmer length k), and contigs from a supplied assembly are processed in turn ( Fig.…”
Section: Methodsmentioning
confidence: 99%