2022
DOI: 10.1101/2022.11.23.517691
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches in Genome Analysis

Abstract: Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either 1) increasing the use of the costly sequence alignment or 2) limited sensitivity.… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
5
1

Relationship

2
4

Authors

Journals

citations
Cited by 7 publications
(7 citation statements)
references
References 134 publications
0
7
0
Order By: Relevance
“…overlapping seeds, calculates their double-strand (w, k)-minimizers, and finds exact matches of minimizers in the reference genome by querying the seed index. Other seeding approaches, such as syncmers 42,100 , strobemers 101 , and BLEND 54 , can be used instead of the minimizer approach. However, to maintain correctness and high sensitivity, the same algorithm used to extract seeds from the patterned genome sequence must be used in the compressed seeding to calculate seeds from the patterned read sequence.…”
Section: Compressed Seedingmentioning
confidence: 99%
See 2 more Smart Citations
“…overlapping seeds, calculates their double-strand (w, k)-minimizers, and finds exact matches of minimizers in the reference genome by querying the seed index. Other seeding approaches, such as syncmers 42,100 , strobemers 101 , and BLEND 54 , can be used instead of the minimizer approach. However, to maintain correctness and high sensitivity, the same algorithm used to extract seeds from the patterned genome sequence must be used in the compressed seeding to calculate seeds from the patterned read sequence.…”
Section: Compressed Seedingmentioning
confidence: 99%
“…Many attempts were made to facilitate searching large genomic data and finding similar genomic sequences (Supplementary Note 1). Recent attempts tend to follow one of three key directions: (1) Building smaller data indexes for faster index access and traversal by extracting a smaller number of seeds from genomic sequences [37][38][39][40][41][42] , (2) Reducing indexing and seeding overhead by avoiding the use of computationally-expensive seeds 25,27,35,[43][44][45] , and (3) Alleviating the accuracy degradation that results from considering only exactly matching seeds between two sequences by using sparse seeds (e.g., spaced seeds) or variable-length seeds [46][47][48][49][50][51][52][53][54][55][56][57][58][59][60] . To our knowledge, most state-of-the-art computational methods suffer from four critical limitations.…”
Section: Mainmentioning
confidence: 99%
See 1 more Smart Citation
“…The length of the reads depends on the sequencing technology, and significantly affects the performance and accuracy of genome analysis. The use of long reads can provide higher accuracy and performance on many genome analysis steps [8][9][10][11][12][13].…”
Section: Introductionmentioning
confidence: 99%
“…Unfortunately, this approach is computationally very expensive and does not scale to large genomic studies that include a large number of individuals for three key reasons. First, mapping even a single read set is computationally expensive [12, 13] (e.g., 75 hours for aligning 300,000,000 short reads, which provides 30 × coverage of the human genome) as it heavily relies on a computationallycostly alignment algorithm [14, 15, 16]. Second, the number of available read sets doubles approximately every 8 months [17, 18], and the rate of growth will continue to increase as sequencing technologies continue to become more cost effective and sequence with higher throughput [19].…”
Section: Introductionmentioning
confidence: 99%