2020
DOI: 10.1089/cmb.2019.0298
|View full text |Cite
|
Sign up to set email alerts
|

Iterative Spaced Seed Hashing: Closing the Gap Between Spaced Seed Hashing and k-mer Hashing

Abstract: Alignment-free classification of sequences has enabled high-throughput processing of sequencing data in many bioinformatics pipelines. Much work has been done to speed up the indexing of k-mers through hash-table and other data structures. These efforts have led to very fast indexes, but because they are k-mer based, they often lack sensitivity due to sequencing errors or polymorphisms. Spaced seeds are a special type of pattern that accounts for errors or mutations. They allow to improve the sensitivity and t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
1
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(7 citation statements)
references
References 22 publications
0
7
0
Order By: Relevance
“…CityHash is a general-purpose hash function, which we adapted to spaced seed hashing by replacing the ‘do not care’ positions in each substring with an asterisk. Iterative Spaced Seed Hashing (ISSH) ( Petrucci et al , 2020 ) is a spaced seed hash function that reuses previous hashes based on the seed’s overlapping patterns. Because CityHash and ISSH lack canonical hashing, we also fed the reverse-complement of the input sequences to compare run times with ntHash2.…”
Section: Resultsmentioning
confidence: 99%
“…CityHash is a general-purpose hash function, which we adapted to spaced seed hashing by replacing the ‘do not care’ positions in each substring with an asterisk. Iterative Spaced Seed Hashing (ISSH) ( Petrucci et al , 2020 ) is a spaced seed hash function that reuses previous hashes based on the seed’s overlapping patterns. Because CityHash and ISSH lack canonical hashing, we also fed the reverse-complement of the input sequences to compare run times with ntHash2.…”
Section: Resultsmentioning
confidence: 99%
“…When filtering spaced k-mers or searching their occurrences, we currently convert each k-mer to a spaced k-mer separately which takes O(q) time where q is the weight of the spaced seed. We could speed this up by using the technique of Petrucci et al [36] which uses the previous overlapping spaced k-mers to construct the next spaced k-mer. However, currently the consensus step of LOMEX is the most time-consuming step so overall this improvement would only have a minor effect.…”
Section: Discussionmentioning
confidence: 99%
“…(Lin et al, 2008) and SHRiMP2 (David et al, 2011) use spaced seeds to improve the sensitivity when mapping short reads (i.e., Illumina paired-end reads). Although spaced seeds enable finding fuzzy seed matches to improve the sensitivity, generating the hash values of spaced seeds is computationally costly (Petrucci et al, 2020), and spaced seeds cannot find any arbitrary fuzzy seed pairs as these seeds mask a fixed number of characters at certain positions. There have been recent improvements in determining the positions and the number of masks on a seed to improve the sensitivity of spaced seeds (Petrucci et al, 2020;Mallik and Ilie, 2021).…”
Section: Introductionmentioning
confidence: 99%
“…Although spaced seeds enable finding fuzzy seed matches to improve the sensitivity, generating the hash values of spaced seeds is computationally costly (Petrucci et al, 2020), and spaced seeds cannot find any arbitrary fuzzy seed pairs as these seeds mask a fixed number of characters at certain positions. There have been recent improvements in determining the positions and the number of masks on a seed to improve the sensitivity of spaced seeds (Petrucci et al, 2020;Mallik and Ilie, 2021). However, none of these works can enable finding arbitrary fuzzy seed matches as they only find patterns for masking positions that still require exact matches at certain positions, which is a key limitation in improving the sensitivity with spaced seeds.…”
Section: Introductionmentioning
confidence: 99%