BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches in Genome Analysis

Fırtına, Can; Park, Jisung; Alser, Mohammed; Kim, Jeremie S.; Cali, Damla Senol; Shahroodi, Taha; Ghiasi, Nika Mansouri; Singh, Gagandeep; Kanellopoulos, Konstantinos; Alkan, Can; Mutlu, Onur

doi:10.1101/2022.11.23.517691

Cited by 7 publications

(7 citation statements)

References 134 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…overlapping seeds, calculates their double-strand (w, k)-minimizers, and finds exact matches of minimizers in the reference genome by querying the seed index. Other seeding approaches, such as syncmers 42,100 , strobemers 101 , and BLEND 54 , can be used instead of the minimizer approach. However, to maintain correctness and high sensitivity, the same algorithm used to extract seeds from the patterned genome sequence must be used in the compressed seeding to calculate seeds from the patterned read sequence.…”

Section: Compressed Seedingmentioning

confidence: 99%

“…Many attempts were made to facilitate searching large genomic data and finding similar genomic sequences (Supplementary Note 1). Recent attempts tend to follow one of three key directions: (1) Building smaller data indexes for faster index access and traversal by extracting a smaller number of seeds from genomic sequences [37][38][39][40][41][42] , (2) Reducing indexing and seeding overhead by avoiding the use of computationally-expensive seeds 25,27,35,[43][44][45] , and (3) Alleviating the accuracy degradation that results from considering only exactly matching seeds between two sequences by using sparse seeds (e.g., spaced seeds) or variable-length seeds [46][47][48][49][50][51][52][53][54][55][56][57][58][59][60] . To our knowledge, most state-of-the-art computational methods suffer from four critical limitations.…”

Section: Mainmentioning

confidence: 99%

“…There are other techniques, such as S-conLSH 53 , conLSH 55 , and BLEND 54 , that allow generating the same hash value for highly similar seeds or seeds sharing similar context (e.g., neighboring seeds) so that inexact seed matches can be still found using hash tables.…”

Section: Data Availabilitymentioning

confidence: 99%

See 2 more Smart Citations

Taming Large-Scale Genomic Analyses via Sparsified Genomics

Alser

Julien

Multu

2022

Preprint

View full text Add to dashboard Cite

Searching for similar genomic sequences is an essential and fundamental step in biomedical research and an overwhelming majority of genomic analyses. State-of-the-art computational methods performing such comparisons fail to cope with the exponential growth of genomic sequencing data. We introduce the concept of sparsified genomics where we systematically exclude a large number of bases from genomic sequences and enable much faster and more memory-efficient processing of the sparsified, shorter genomic sequences, while providing similar or even higher accuracy compared to processing non-sparsified sequences. Sparsified genomics provides significant benefits to many genomic analyses and has broad applicability. We show that sparsifying genomic sequences greatly accelerates the state-of-the-art read mapper (minimap2) by 1.54-8.8x using real Illumina, HiFi, and ONT reads, while providing a higher number of mapped reads and more detected small and structural variations. Sparsifying genomic sequences makes containment search through very large genomes and very large databases 72.7-75.88x faster and 723.3x more storage-efficient than searching through non-sparsified genomic sequences (with CMash and KMC3). Sparsifying genomic sequences enables robust microbiome discovery by providing 54.15-61.88x faster and 720x more storage-efficient taxonomic profiling of metagenomic samples over the state-of-art tool (Metalign). We design and open-source a framework called Genome-on-Diet as an example tool for sparsified genomics, which can be freely downloaded from https://github.com/CMU-SAFARI/Genome-on-Diet.

show abstract

Section: Compressed Seedingmentioning

confidence: 99%

Section: Mainmentioning

confidence: 99%

See 1 more Smart Citation

Taming Large-Scale Genomic Analyses via Sparsified Genomics

Alser

Julien

Multu

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The length of the reads depends on the sequencing technology, and significantly affects the performance and accuracy of genome analysis. The use of long reads can provide higher accuracy and performance on many genome analysis steps [8][9][10][11][12][13].…”

Section: Introductionmentioning

confidence: 99%

TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering

Banu

Singh

Alser

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, i.e., reads. State-of-the-art basecallers employ complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally-inefficient and memory-hungry; bottlenecking the entire genome analysis pipeline. However, for many applications, the majority of reads do no match the reference genome of interest (i.e., target reference) and thus are discarded in later steps in the genomics pipeline, wasting the basecalling computation. To overcome this issue, we propose TargetCall, the first fast and widely-applicable pre-basecalling filter to eliminate the wasted computation in basecalling. TargetCall's key idea is to discard reads that will not match the target reference (i.e., off-target reads) prior to basecalling. TargetCall consists of two main components: (1) LightCall, a lightweight neural network basecaller that produces noisy reads; and (2) Similarity Check, which labels each of these noisy reads as on-target or off-target by matching them to the target reference. TargetCall filters out all off-target reads before basecalling; and the highly-accurate but slow basecalling is performed only on the raw signals whose noisy reads are labeled as on-target. Our thorough experimental evaluations using both real and simulated data show that TargetCall 1) improves the end-to-end basecalling performance of the state-of-the-art basecaller by 3.31x while maintaining high (98.88%) sensitivity in keeping on-target reads, 2) maintains high accuracy in downstream analysis, 3) precisely filters out up to 94.71% of off-target reads, and 4) achieves better performance, sensitivity, and generality compared to prior works. We freely open-source TargetCall to aid future research in pre-basecalling filtering at https://github.com/CMU-SAFARI/TargetCall.

show abstract

“…Unfortunately, this approach is computationally very expensive and does not scale to large genomic studies that include a large number of individuals for three key reasons. First, mapping even a single read set is computationally expensive [12, 13] (e.g., 75 hours for aligning 300,000,000 short reads, which provides 30 × coverage of the human genome) as it heavily relies on a computationallycostly alignment algorithm [14, 15, 16]. Second, the number of available read sets doubles approximately every 8 months [17, 18], and the rate of growth will continue to increase as sequencing technologies continue to become more cost effective and sequence with higher throughput [19].…”

Section: Introductionmentioning

confidence: 99%

AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

Kim

Fırtına

Banu

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

As genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes, which enable sensitivity in read mapping and downstream analysis such as variant calling. A more sensitive downstream analysis is critical for a better understanding of the genome donor (e.g., health characteristics). Therefore, read sets from sequenced samples should ideally be mapped to the latest available reference genome that represents the most relevant population. Unfortunately, the increasingly large amount of available genomic data makes it prohibitively expensive to fully re-map each read set to its respective reference genome every time the reference is updated. There are several tools that attempt to accelerate the process of updating a read data set from one reference to another (i.e., remapping) by 1) identifying regions that appear similarly between two references and 2) updating the mapping location of reads that map to any of the identified regions in the old reference to the corresponding similar region in the new reference. The main drawback of existing approaches is that if a read maps to a region in the old reference that does not appear with a reasonable degree of similarity in the new reference, the read cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations (i.e., coding regions in a genome) are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for remapping alignments from one genome to another. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces 1) the number of reads (out of the entire read set) that need to be fully mapped to the new reference by up to 99.99\% and 2) the overall execution time to remap read sets between two reference genome versions by 6.7x, 6.6x, and 2.8x for large (human), medium (C. elegans), and small (yeast) reference genomes, respectively. We validate our remapping results with GATK and find that AirLift provides similar accuracy in identifying ground truth SNP and INDEL variants as the baseline of fully mapping a read set.

show abstract

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches in Genome Analysis

Cited by 7 publications

References 134 publications

Taming Large-Scale Genomic Analyses via Sparsified Genomics

Taming Large-Scale Genomic Analyses via Sparsified Genomics

TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering

AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

Contact Info

Product

Resources

About