Iterative Spaced Seed Hashing: Closing the Gap Between Spaced Seed Hashing and <i>k</i>-mer Hashing

Petrucci, Enrico; Noé, Laurent; Pizzi, Cinzia; Comin, Matteo

doi:10.1089/cmb.2019.0298

Cited by 10 publications

(7 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CityHash is a general-purpose hash function, which we adapted to spaced seed hashing by replacing the ‘do not care’ positions in each substring with an asterisk. Iterative Spaced Seed Hashing (ISSH) ( Petrucci et al , 2020 ) is a spaced seed hash function that reuses previous hashes based on the seed’s overlapping patterns. Because CityHash and ISSH lack canonical hashing, we also fed the reverse-complement of the input sequences to compare run times with ntHash2.…”

Section: Resultsmentioning

confidence: 99%

ntHash2: recursive spaced seed hashing for nucleotide sequences

et al. 2022

View full text Add to dashboard Cite

Motivation Spaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various bioinformatics tools for efficient sequence analysis with applications in genome research. Results ntHash2 is up to 2.1x faster at hashing various spaced seeds than the previous version and 3.8x faster than conventional hashing algorithms with naïve adaptation. Additionally, we reduced the collision rate of ntHash for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism. Availability ntHash2 is freely available online at github.com/bcgsc/ntHash under an MIT license. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Section: Resultsmentioning

confidence: 99%

ntHash2: recursive spaced seed hashing for nucleotide sequences

et al. 2022

View full text Add to dashboard Cite

show abstract

“…When filtering spaced k-mers or searching their occurrences, we currently convert each k-mer to a spaced k-mer separately which takes O(q) time where q is the weight of the spaced seed. We could speed this up by using the technique of Petrucci et al [36] which uses the previous overlapping spaced k-mers to construct the next spaced k-mer. However, currently the consensus step of LOMEX is the most time-consuming step so overall this improvement would only have a minor effect.…”

Section: Discussionmentioning

confidence: 99%

Extraction of Long k-mers Using Spaced Seeds

Leinonen¹,

Salmela²

2022

IEEE/ACM Trans. Comput. Biol. and Bioinf.

View full text Add to dashboard Cite

The extraction of k-mers from reads is an important task in many bioinformatics applications, such as all DNA sequence analysis methods based on de Bruijn graphs. These methods tend to be more accurate when the used k-mers are unique in the analyzed DNA, and thus the use of longer k-mers is preferred. When the read lengths of short read sequencing technologies increase, the error rate will become the determining factor for the largest possible value of k. Here we propose LOMEX which uses spaced seeds to extract long k-mers accurately even in the presence of sequencing errors. Our experiments show that LOMEX can extract long k-mers from current Illumina reads with a similar or higher recall than a standard k-mer counting tool. Furthermore, our experiments on simulated data show that when the read length further increases enabling even longer k-mers, the performance of standard k-mer counters declines, whereas LOMEX still extracts long k-mers successfully.

show abstract

“…(Lin et al, 2008) and SHRiMP2 (David et al, 2011) use spaced seeds to improve the sensitivity when mapping short reads (i.e., Illumina paired-end reads). Although spaced seeds enable finding fuzzy seed matches to improve the sensitivity, generating the hash values of spaced seeds is computationally costly (Petrucci et al, 2020), and spaced seeds cannot find any arbitrary fuzzy seed pairs as these seeds mask a fixed number of characters at certain positions. There have been recent improvements in determining the positions and the number of masks on a seed to improve the sensitivity of spaced seeds (Petrucci et al, 2020;Mallik and Ilie, 2021).…”

Section: Introductionmentioning

confidence: 99%

“…Although spaced seeds enable finding fuzzy seed matches to improve the sensitivity, generating the hash values of spaced seeds is computationally costly (Petrucci et al, 2020), and spaced seeds cannot find any arbitrary fuzzy seed pairs as these seeds mask a fixed number of characters at certain positions. There have been recent improvements in determining the positions and the number of masks on a seed to improve the sensitivity of spaced seeds (Petrucci et al, 2020;Mallik and Ilie, 2021). However, none of these works can enable finding arbitrary fuzzy seed matches as they only find patterns for masking positions that still require exact matches at certain positions, which is a key limitation in improving the sensitivity with spaced seeds.…”

Section: Introductionmentioning

confidence: 99%

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches in Genome Analysis

Fırtına¹,

Park²,

Alser³

et al. 2021

Preprint

View full text Add to dashboard Cite

Motivation: Identifying sequence similarity is a fundamental step in genomic analyses, which is typically performed by first matching short subsequences of each genomic sequence, called seeds, and then verifying the similarity between sequences with sufficient number of matching seeds. The length and number of seed matches between sequences directly impact the accuracy and performance of identifying sequence similarity. Existing attempts optimizing seed matches suffer from performing either 1) the costly similarity verification for too many sequence pairs due to finding a large number of exact-matching seeds or 2) costly calculations to find fewer fuzzy (i.e., approximate) seed matches. Our goal is to efficiently find fuzzy seed matches to improve the performance, memory efficiency, and accuracy of identifying sequence similarity. To this end, we introduce BLEND, a fast, memory-efficient, and accurate mechanism to find fuzzy seed matches. BLEND 1) generates hash values for seeds so that similar seeds may have the same hash value, and 2) uses these hash values to efficiently find fuzzy seed matches between sequences. Results: We show the benefits of BLEND when used in two important genomics applications: finding overlapping reads and read mapping. For finding overlapping reads, BLEND enables a 0.9×-22.4× (on average 8.6×) faster and 1.8×-6.9× (on average 5.43×) more memory-efficient implementation than the state-of-the-art tool, Minimap2. We observe that BLEND finds better quality overlaps that lead to more accurate de novo assemblies compared to Minimap2. When mapping high coverage and accurate long reads, BLEND on average provides 1.2× speedup compared to Minimap2.

show abstract

Iterative Spaced Seed Hashing: Closing the Gap Between Spaced Seed Hashing and k-mer Hashing

Cited by 10 publications

References 22 publications

ntHash2: recursive spaced seed hashing for nucleotide sequences

ntHash2: recursive spaced seed hashing for nucleotide sequences

Extraction of Long k-mers Using Spaced Seeds

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches in Genome Analysis

Contact Info

Product

Resources

About