Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious <i>kd</i>-trees

Mahmud, Pavel; Wiedenhoeft, John; Schliep, Alexander

doi:10.1093/bioinformatics/bts380

Cited by 5 publications

(3 citation statements)

References 55 publications

(68 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These often focus on rare k -mers e.g. for species identification [ 11 ]—significantly similar in spirit to oligonucleotide probes in DNA-microarrays [ 12 , 13 ]—or compute differences using [ 14 ] or Jaccard-distances [ 15 ].…”

Section: Introductionmentioning

confidence: 99%

Fast parallel construction of variable-length Markov chains

et al. 2021

View full text Add to dashboard Cite

Background Alignment-free methods are a popular approach for comparing biological sequences, including complete genomes. The methods range from probability distributions of sequence composition to first and higher-order Markov chains, where a k-th order Markov chain over DNA has $$4^k$$ 4 k formal parameters. To circumvent this exponential growth in parameters, variable-length Markov chains (VLMCs) have gained popularity for applications in molecular biology and other areas. VLMCs adapt the depth depending on sequence context and thus curtail excesses in the number of parameters. The scarcity of available fast, or even parallel software tools, prompted the development of a parallel implementation using lazy suffix trees and a hash-based alternative. Results An extensive evaluation was performed on genomes ranging from 12Mbp to 22Gbp. Relevant learning parameters were chosen guided by the Bayesian Information Criterion (BIC) to avoid over-fitting. Our implementation greatly improves upon the state-of-the-art even in serial execution. It exhibits very good parallel scaling with speed-ups for long sequences close to the optimum indicated by Amdahl’s law of 3 for 4 threads and about 6 for 16 threads, respectively. Conclusions Our parallel implementation released as open-source under the GPLv3 license provides a practically useful alternative to the state-of-the-art which allows the construction of VLMCs even for very large genomes significantly faster than previously possible. Additionally, our parameter selection based on BIC gives guidance to end-users comparing genomes.

show abstract

Section: Introductionmentioning

confidence: 99%

Fast parallel construction of variable-length Markov chains

et al. 2021

View full text Add to dashboard Cite

show abstract

“…In regions with genomic variation (e.g. those regions in which the investigator is usually most interested), maintaining good performance generally leads to lower sensitivity (Gontarz et al 2013;Mahmud et al 2012). In addition, the Burrows-Wheeler transform method is less flexible than hash based methods.…”

Section: Introductionmentioning

confidence: 99%

MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping

et al. 2014

View full text Add to dashboard Cite

MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT. Indeed, MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low-coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating outputs tailored to aid in SV discovery. All variant discovery benefits from an accurate description of the read placement confidence. To this end, MOSAIK uses a neural-network based training scheme to provide well-calibrated mapping quality scores, demonstrated by a correlation coefficient between MOSAIK assigned and actual mapping qualities greater than 0.98. In order to ensure that studies of any genome are supported, a training pipeline is provided to ensure optimal mapping quality scores for the genome under investigation. MOSAIK is multi-threaded, open source, and incorporated into our command and pipeline launcher system GKNO (http://gkno.me).

show abstract

“…The amount of data produced by current high-throughput DNA sequencing machines such as Illumina HiSeq 2500, which can generate as much as 100Gb a day, demands enormous computational power for primary analysis tasks such as read mapping. Although a large body of literature is concerned with read mapping [21,20,19,33,14,11,26,12,42,27,1], most approaches map one read at a time. The order of mapping is arbitrary regardless of similarities between reads which might hint towards the same mapping location.…”

Section: Introductionmentioning

confidence: 99%

TreQ-CG: Clustering Accelerates High-Throughput Sequencing Read Mapping

Mahmud,

Schliep

2014

Preprint

View full text Add to dashboard Cite

As high-throughput sequencers become standard equipment outside of sequencing centers, there is an increasing need for efficient methods for pre-processing and primary analysis. While a vast literature proposes methods for HTS data analysis, we argue that significant improvements can still be gained by exploiting expensive pre-processing steps which can be amortized with savings from later stages. We propose a method to accelerate and improve read mapping based on an initial clustering of possibly billions of high-throughput sequencing reads, yielding clusters of high stringency and a high degree of overlap. This clustering improves on the state-of-the-art in running time for small datasets and, for the first time, makes clustering high-coverage human libraries feasible. Given the efficiently computed clusters, only one representative read from each cluster needs to be mapped using a traditional readmapper such as BWA, instead of individually mapping all reads. On human reads, all processing steps, including clustering and mapping, only require 11%-59% of the time for individually mapping all reads, achieving speed-ups for all readmappers, while minimally affecting mapping quality. This accelerates a highly sensitive readmapper such as Stampy to be competitive with a fast readmapper such as BWA on unclustered reads.

show abstract

Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees

Cited by 5 publications

References 55 publications

Fast parallel construction of variable-length Markov chains

Fast parallel construction of variable-length Markov chains

MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping

TreQ-CG: Clustering Accelerates High-Throughput Sequencing Read Mapping

Contact Info

Product

Resources

About