BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis

Fırtına, Can; Park, Jisung; Alser, Mohammed; Kim, Jeremie S.; Cali, Damla Senol; Shahroodi, Taha; Ghiasi, Nika Mansouri; Singh, Gagandeep; Kanellopoulos, Konstantinos; Alkan, Can; Mutlu, Onur

doi:10.1093/nargab/lqad004

Cited by 21 publications

(15 citation statements)

References 114 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the runtime, we evaluated randstrobes parametrized as ( n = 2, l = 20, w min = 21, w max = 100) and ( n = 2, l = 20, w min = 21, w max = 1000) since the window size affects runtime. Strobemers with n > 3 show no substantial gain in the context of sequence matching at the cost of additional runtime [12](although they have been modified and used for specific scenarios [8]). Also, the relative performance can be extrapolated from the n = 2 and n = 3 cases, since the construction is recursive, therefore, we omit them in this study.…”

Section: Resultsmentioning

confidence: 99%

“…Three different methods to link the k -mers (minstrobes, randstrobes, and hybridstrobes) were described in, [18] where the most effective seed was randstrobes. While there are applications that use other strobemer types [8], randstrobes have been most frequently used, e.g., for short-read mapping [20], transcriptomic long-read normalization [15], and read classification [23] in bioinformatic applications. Our recent proof-of-concept study also shows that randstrobes can provide accurate sequence similarity ranking through estimating the Jaccard distance [12].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Designing efficient randstrobes for sequence similarity analyses

Karami,

Mohammadi,

Martin

et al. 2023

Preprint

View full text Add to dashboard Cite

Substrings of length k, commonly referred to as k-mers, play a vital role in sequence analysis, reducing the search space by providing anchors between queries and references. However, k-mers are limited to exact matches between sequences. This has led to alternative constructs, such as spaced k-mers, that can match across substitutions. We recently introduced a class of new constructs, strobemers, that can match across substitutions and smaller insertions and deletions. Randstrobes, the most sensitive strobemer proposed in (Sahlin, 2021), has been incorporated into several bioinformatics applications such as read classification, short read mapping, and read overlap detection. Randstrobes are constructed by linking together k-mers in a pseudo-random fashion and depend on a hash function, a link function, and a comparator for their construction. Recently, we showed that the more random this linking appears (measured in entropy), the more efficient the seeds for sequence similarity analysis. The level of pseudo-randomness will depend on the hashing, linking, and comparison operators. However, no study has investigated the efficacy of the underlying operators to produce randstrobes. In this study, we propose several new construction methods. One of our proposed methods is based on a Binary Search Tree (BST), which lowers the time complexity and practical runtime to other methods for some parametrizations. To our knowledge, we are also the first to describe and study the types of biases that occur during construction. We designed three metrics to measure the bias. Using these new evaluation metrics, we uncovered biases and limitations in previous methods and showed that our proposed methods have favorable speed and sampling uniformity to previously proposed methods. Lastly, guided by our results, we change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. Also, we suggest combining the two versions to improve accuracy for the shortest reads in our evaluated datasets. Our evaluation highlights sampling biases that can occur and provides guidance on which operators to use when implementing randstrobes.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Designing efficient randstrobes for sequence similarity analyses

Karami,

Mohammadi,

Martin

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Second, since RawHash generates hash values for matching similar regions, it provides opportunities to use the hash-based seeding techniques [14, 41–70] that are optimized for identifying sequence similarities accurately without requiring large memory space, such as minimizers [14, 71], spaced seeds [45], syncmers [67], strobemers [68, 69], and fuzzy seed matching as in BLEND [70]. Although we do not evaluate in this work, we implement the minimizer seeding technique in RawHash.…”

Section: Discussionmentioning

confidence: 99%

RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes

Fırtına

Ghiasi

Lindegger

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

Nanopore sequencers generate electrical raw signals in real-time while sequencing long genomic strands. These raw signals can be analyzed as they are generated, providing an opportunity for real-time genome analysis. An important feature of nanopore sequencing, Read Until, can eject strands from sequencers without fully sequencing them, which provides opportunities to computationally reduce the sequencing time and cost. However, existing works utilizing Read Until either 1) require powerful computational resources that may not be available for portable sequencers or 2) lack scalability for large genomes, rendering them inaccurate or ineffective. We propose RawHash, the first mechanism that can accurately and efficiently perform real-time analysis of nanopore raw signals for large genomes using a hash-based similarity search. To enable this, RawHash ensures the signals corresponding to the same DNA content lead to the same hash value, regardless of the slight variations in these signals. RawHash achieves an accurate hash-based similarity search via an effective quantization of the raw signals such that signals corresponding to the same DNA content have the same quantized value and, subsequently, the same hash value. We evaluate RawHash on three applications: 1) read mapping, 2) relative abundance estimation, and 3) contamination analysis. Our evaluations show that RawHash is the only tool that can provide high accuracy and high throughput for analyzing large genomes in real-time. When compared to the state-of-the-art techniques, UNCALLED and Sigmap, RawHash provides 1) 25.8x and 3.4x better average throughput and 2) an average speedup of 32.1x and 2.1x in the mapping time, respectively. Source code is available at https://github.com/CMU-SAFARI/RawHash.

show abstract

“…The methods to select strobes differ ( Sahlin 2021a ). For example, Minstrobes have been used for long-read overlap detection ( Firtina et al 2023 ) and alternating strobe lengths have also been explored ( Maier and Sahlin 2023 ). However, randstrobes were shown to be more sensitive for sequence matching than other methods using fixed strobe lengths (minstrobes and hybridstrobes) ( Sahlin 2021a ), and simpler to construct than alternating strobe lengths (altstrobes and multistrobes) ( Maier and Sahlin 2023 ), and is so far most commonly implemented in practice ( Sahlin 2022 , Nip et al 2023 , Xu et al 2023 ).…”

Section: Methodsmentioning

confidence: 99%

Designing efficient randstrobes for sequence similarity analyses

Karami,

Soltani Mohammadi,

Martin

et al. 2024

Bioinformatics

View full text Add to dashboard Cite

Motivation Substrings of length k, commonly referred to as k-mers, play a vital role in sequence analysis. However, k-mers are limited to exact matches between sequences leading to alternative constructs. We recently introduced a class of new constructs, strobemers, that can match across substitutions and smaller insertions and deletions. Randstrobes, the most sensitive strobemer proposed in Sahlin (Effective sequence similarity detection with strobemers. Genome Res 2021a;31:2080–94. https://doi.org/10.1101/gr.275648.121), has been used in several bioinformatics applications such as read classification, short-read mapping, and read overlap detection. Recently, we showed that the more pseudo-random the behavior of the construction (measured in entropy), the more efficient the seeds for sequence similarity analysis. The level of pseudo-randomness depends on the construction operators, but no study has investigated the efficacy. Results In this study, we introduce novel construction methods, including a Binary Search Tree-based approach that improves time complexity over previous methods. To our knowledge, we are also the first to address biases in construction and design three metrics for measuring bias. Our evaluation shows that our methods have favorable speed and sampling uniformity compared to existing approaches. Lastly, guided by our results, we change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. We suggest combining the two results to improve strobealign’s accuracy for the shortest reads in our evaluated datasets. Our evaluation highlights sampling biases that can occur and provides guidance on which operators to use when implementing randstrobes. Availability and implementation All methods and evaluation benchmarks are available in a public Github repository at https://github.com/Moein-Karami/RandStrobes. The scripts for running the strobealign analysis are found at https://github.com/NBISweden/strobealign-evaluation.

show abstract

BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis

Cited by 21 publications

References 114 publications

Designing efficient randstrobes for sequence similarity analyses

Designing efficient randstrobes for sequence similarity analyses

RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes

Designing efficient randstrobes for sequence similarity analyses

Contact Info

Product

Resources

About