2023
DOI: 10.1093/nargab/lqad004
|View full text |Cite
|
Sign up to set email alerts
|

BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis

Abstract: Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either (i) increasing the use of the costly sequence alignment or (ii) limited sensitivi… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 21 publications
(15 citation statements)
references
References 114 publications
0
10
0
Order By: Relevance
“…For the runtime, we evaluated randstrobes parametrized as ( n = 2, l = 20, w min = 21, w max = 100) and ( n = 2, l = 20, w min = 21, w max = 1000) since the window size affects runtime. Strobemers with n > 3 show no substantial gain in the context of sequence matching at the cost of additional runtime [12](although they have been modified and used for specific scenarios [8]). Also, the relative performance can be extrapolated from the n = 2 and n = 3 cases, since the construction is recursive, therefore, we omit them in this study.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…For the runtime, we evaluated randstrobes parametrized as ( n = 2, l = 20, w min = 21, w max = 100) and ( n = 2, l = 20, w min = 21, w max = 1000) since the window size affects runtime. Strobemers with n > 3 show no substantial gain in the context of sequence matching at the cost of additional runtime [12](although they have been modified and used for specific scenarios [8]). Also, the relative performance can be extrapolated from the n = 2 and n = 3 cases, since the construction is recursive, therefore, we omit them in this study.…”
Section: Resultsmentioning
confidence: 99%
“…Three different methods to link the k -mers (minstrobes, randstrobes, and hybridstrobes) were described in, [18] where the most effective seed was randstrobes. While there are applications that use other strobemer types [8], randstrobes have been most frequently used, e.g., for short-read mapping [20], transcriptomic long-read normalization [15], and read classification [23] in bioinformatic applications. Our recent proof-of-concept study also shows that randstrobes can provide accurate sequence similarity ranking through estimating the Jaccard distance [12].…”
Section: Introductionmentioning
confidence: 99%
“…Second, since RawHash generates hash values for matching similar regions, it provides opportunities to use the hash-based seeding techniques [14, 4170] that are optimized for identifying sequence similarities accurately without requiring large memory space, such as minimizers [14, 71], spaced seeds [45], syncmers [67], strobemers [68, 69], and fuzzy seed matching as in BLEND [70]. Although we do not evaluate in this work, we implement the minimizer seeding technique in RawHash.…”
Section: Discussionmentioning
confidence: 99%
“…The methods to select strobes differ ( Sahlin 2021a ). For example, Minstrobes have been used for long-read overlap detection ( Firtina et al 2023 ) and alternating strobe lengths have also been explored ( Maier and Sahlin 2023 ). However, randstrobes were shown to be more sensitive for sequence matching than other methods using fixed strobe lengths (minstrobes and hybridstrobes) ( Sahlin 2021a ), and simpler to construct than alternating strobe lengths (altstrobes and multistrobes) ( Maier and Sahlin 2023 ), and is so far most commonly implemented in practice ( Sahlin 2022 , Nip et al 2023 , Xu et al 2023 ).…”
Section: Methodsmentioning
confidence: 99%