ntCard: a streaming algorithm for cardinality estimation in genomics data

Mohamadi, Hamid; Khan, Hira; Birol, İnanç

doi:10.1093/bioinformatics/btw832

Cited by 68 publications

(67 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The white boxes are FASTA files and the grey boxes represent the tools that process or generate them. Ntcard [24] is used to select the best-suited k-mer size. A compacted DBG is then constructed using Bcalm2 [25].…”

Section: Dbg-based Reads Correctionmentioning

confidence: 99%

Toward perfect reads: short reads correction via mapping on compacted de Bruijn graphs

Limasset

Flot

Peterlongo³

2019

Preprint

View full text Add to dashboard Cite

Motivations Short-read accuracy is important for downstream analyses such as genome assembly and hybrid long-read correction. Despite much work on short-read correction, present-day correctors either do not scale well on large data sets or consider reads as mere suites of k-mers, without taking into account their full-length read information. Results We propose a new method to correct short reads using de Bruijn graphs, and implement it as a tool called Bcool. As a first step, Bcool constructs a compacted de Bruijn graph from the reads. This graph is filtered on the basis of k-mer abundance then of unitig abundance, thereby removing most sequencing errors. The cleaned graph is then used as a reference on which the reads are mapped to correct them. We show that this approach yields more accurate reads than k-mer-spectrum correctors while being scalable to human-size genomic datasets and beyond. Availability and ImplementationThe implementation is open source and available at http: //github.com/Malfoy/BCOOL under the Affero GPL license and as a Bioconda package.

show abstract

Section: Dbg-based Reads Correctionmentioning

confidence: 99%

Toward perfect reads: short reads correction via mapping on compacted de Bruijn graphs

Limasset

Flot

Peterlongo³

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…Genomes were downloaded from ENSEMBL (Yates et al, 2016). The program ntCard (Mohamadi, Khan, & Birol, 2016) was used to estimate the number of distinct kmers (subsequences of length k) for each set of contaminant genomes and estimate the number of elements to be inserted into each Bloom filter (Table S1). All Bloom filters were created with a target false-positive rate (FPR) of 2%.…”

Section: Environmental Contaminant Screeningmentioning

confidence: 99%

A novel approach to wildlife transcriptomics provides evidence of disease‐mediated differential expression and changes to the microbiome of amphibian populations

et al. 2018

Self Cite

View full text Add to dashboard Cite

Ranaviruses are responsible for a lethal, emerging infectious disease in amphibians and threaten their populations throughout the world. Despite this, little is known about how amphibian populations respond to ranaviral infection. In the United Kingdom, ranaviruses impact the common frog (Rana temporaria). Extensive public engagement in the study of ranaviruses in the UK has led to the formation of a unique system of field sites containing frog populations of known ranaviral disease history. Within this unique natural field system, we used RNA sequencing (RNA-Seq) to compare the gene expression profiles of R. temporaria populations with a history of ranaviral disease and those without. We have applied a RNA read-filtering protocol that incorporates Bloom filters, previously used in clinical settings, to limit the potential for contamination that comes with the use of RNA-Seq in nonlaboratory systems. We have identified a suite of 407 transcripts that are differentially expressed between populations of different ranaviral disease history. This suite contains genes with functions related to immunity, development, protein transport and olfactory reception among others. A large proportion of potential noncoding RNA transcripts present in our differentially expressed set provide first evidence of a possible role for long noncoding RNA (lncRNA) in amphibian response to viruses. Our read-filtering approach also removed significantly more bacterial reads from libraries generated from positive disease history populations. Subsequent analysis revealed these bacterial read sets to represent distinct communities of bacterial species, which is suggestive of an interaction between ranavirus and the host microbiome in the wild.

show abstract

“…We first run ntHits (v0.0.1; https://github.com/bcgsc/nthits; Supplemental Methods) to remove error kmers from high throughput sequencing data, and build a canonical representation of coverage-thresholded kmers [8] using a Bloom filter, while maintaining a low false positive rate (≈0.0005). The Bloom filter is then read by ntEdit (v1.1.0 with matching kmer length k), and contigs from a supplied assembly are processed in turn ( Fig.…”

Section: Methodsmentioning

confidence: 99%

ntEdit: scalable genome assembly polishing

Rm¹,

Coombe²,

Mohamadi³

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes. We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled E. coli and C. elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20X), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14s and <3m, on average, on E. coli and C. elegans, respectively. We performed similar benchmarks on a sub-20X coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30-40m on those sequences. We show how ntEdit ran in <2h20m to improve upon long and linked read human genome assemblies of NA12878, using high coverage (54X) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gbp interior and white spruce genomes in <4 and <5h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024. Availability: https://github.com/bcgsc/ntedit

show abstract

ntCard: a streaming algorithm for cardinality estimation in genomics data

Cited by 68 publications

References 30 publications

Toward perfect reads: short reads correction via mapping on compacted de Bruijn graphs

Toward perfect reads: short reads correction via mapping on compacted de Bruijn graphs

A novel approach to wildlife transcriptomics provides evidence of disease‐mediated differential expression and changes to the microbiome of amphibian populations

ntEdit: scalable genome assembly polishing

Contact Info

Product

Resources

About