GenStore: a high-performance in-storage processing system for genome sequence analysis

Ghiasi, Nika Mansouri; Park, Jisung; Mustafa, Harun; Kim, Jeremie; Olgun, Ataberk; Gollwitzer, Arvid; Cali, Damla Senol; Fırtına, Can; Mao, Haiyu; Alserr, Nour Almadhoun; Ausavarungnirun, Rachata; Vijaykumar, Nandita; Alser, Mohammed; Mutlu, Onur

doi:10.1145/3503222.3507702

Cited by 34 publications

(13 citation statements)

References 300 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We already provide the SIMD implementation to calculate the hash values BLEND. We encourage implementing our mechanism for the applications that use seeds to find sequence similarity using processing-inmemory and near-data processing [102][103][104][105][106][107][108][109][110][111][112][113][114], GPUs [115][116][117], and FPGAs and ASICs [118][119][120][121][122][123] to exploit the massive amount of embarrassingly parallel bitwise operations in BLEND to find fuzzy seed matches. Third, we believe it is possible to apply the hashing technique we use in BLEND for many seeding techniques with a proper design.…”

Section: Discussionmentioning

confidence: 99%

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches in Genome Analysis

Fırtına

Park

Alser

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either 1) increasing the use of the costly sequence alignment or 2) limited sensitivity. We introduce BLEND, the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seeds matches. BLEND 1) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and 2) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4x-83.9x (on average 19.3x), has a lower memory footprint by 0.9x-14.1x (on average 3.8x), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8x-4.1x (on average 1.7x) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND.

show abstract

Section: Discussionmentioning

confidence: 99%

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches in Genome Analysis

Fırtına

Park

Alser

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Sequence-to-Sequence Accelerators. Even though there are several hardware accelerators designed to alleviate bottlenecks in several steps of traditional sequence-to-sequence (S2S) mapping (e.g., pre-alignment filtering [72,73,75,76,94,[140][141][142][143][144][145][146][147][148], sequenceto-sequence alignment [68-70, 129-132, 149-151]), none of these designs can be directly employed for the sequence-to-graph (S2G) mapping problem. This is because S2S mapping is a special case of S2G mapping, where all nodes have only one edge (Figure 3a).…”

Section: Accelerating Sequence-to-graph Mappingmentioning

confidence: 99%

“…Existing hardware accelerators for genome sequence analysis focus on accelerating only the traditional sequence-to-sequence mapping pipeline, and cannot support genome graphs as their inputs. For example, GenStore [142], ERT [144], GenCache [143], NEST [145], MEDAL [146], SaVI [147], SMEM++ [148], Shifted Hamming Distance [94], GateKeeper [72], MAGNET [140], Shouji [141], and SneakySnake [73,76] accelerate the seeding and/or filtering steps of sequence-to-sequence mapping.…”

Section: Related Workmentioning

confidence: 99%

SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping

Cali,

Kanellopoulos,

Lindegger

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

A critical step of genome sequence analysis is the mapping of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-tosequence mapping). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in a population. Mapping reads to the graph-based reference genome (i.e., sequence-to-graph mapping) results in notable quality improvements in genome analysis. Unfortunately, while sequence-to-sequence mapping is well studied with many available tools and accelerators, sequence-to-graph mapping is a more difficult computational problem, with a much smaller number of practical software tools currently available.We analyze two state-of-the-art sequence-to-graph mapping tools and reveal four key issues. We find that there is a pressing need to have a specialized, high-performance, scalable, and low-cost algorithm/hardware co-design that alleviates bottlenecks in both the seeding and alignment steps of sequence-to-graph mapping. Since sequence-to-sequence mapping can be treated as a special case of sequence-to-graph mapping, we aim to design an accelerator that is efficient for both linear and graph-based read mapping.To this end, we propose SeGraM, a universal algorithm/hardware co-designed genomic mapping accelerator that can effectively and efficiently support both sequence-to-graph mapping and sequenceto-sequence mapping, for both short and long reads. To our knowledge, SeGraM is the first algorithm/hardware co-design for accelerating sequence-to-graph mapping. SeGraM consists of two main components: (1) MinSeed, the first minimizer-based seeding accelerator, which finds the candidate locations in a given genome graph; and (2) BitAlign, the first bitvector-based sequence-to-graph alignment accelerator, which performs alignment between a given read and the subgraph identified by MinSeed. We couple SeGraM with high-bandwidth memory to exploit low latency and highlyparallel memory access, which alleviates the memory bottleneck.

show abstract

“…Performing sequence alignment is still computationally expensive and it is an open research problem [106][107][108][109][110]113 . Due to the low sequencing error rates of Illumina sequencing machines, it is observed that a large fraction of short reads typically maps exactly or with a few mismatches to the reference genome [114][115][116][117] . For example, on average 80% of human short reads map exactly to the human reference genome 114 .…”

Section: Handling Exactly-matching Short Readsmentioning

confidence: 99%

“…For example, on average 80% of human short reads map exactly to the human reference genome 114 . We employ a quick filter 116 that detects exactly-matching reads using SIMD instructions and outputs their alignment information directly to the SAM file without performing sequence alignment calculations for such reads.…”

Section: Handling Exactly-matching Short Readsmentioning

confidence: 99%

Taming Large-Scale Genomic Analyses via Sparsified Genomics

Alser

Julien

Multu

2022

Preprint

View full text Add to dashboard Cite

Searching for similar genomic sequences is an essential and fundamental step in biomedical research and an overwhelming majority of genomic analyses. State-of-the-art computational methods performing such comparisons fail to cope with the exponential growth of genomic sequencing data. We introduce the concept of sparsified genomics where we systematically exclude a large number of bases from genomic sequences and enable much faster and more memory-efficient processing of the sparsified, shorter genomic sequences, while providing similar or even higher accuracy compared to processing non-sparsified sequences. Sparsified genomics provides significant benefits to many genomic analyses and has broad applicability. We show that sparsifying genomic sequences greatly accelerates the state-of-the-art read mapper (minimap2) by 1.54-8.8x using real Illumina, HiFi, and ONT reads, while providing a higher number of mapped reads and more detected small and structural variations. Sparsifying genomic sequences makes containment search through very large genomes and very large databases 72.7-75.88x faster and 723.3x more storage-efficient than searching through non-sparsified genomic sequences (with CMash and KMC3). Sparsifying genomic sequences enables robust microbiome discovery by providing 54.15-61.88x faster and 720x more storage-efficient taxonomic profiling of metagenomic samples over the state-of-art tool (Metalign). We design and open-source a framework called Genome-on-Diet as an example tool for sparsified genomics, which can be freely downloaded from https://github.com/CMU-SAFARI/Genome-on-Diet.

show abstract

GenStore: a high-performance in-storage processing system for genome sequence analysis

Cited by 34 publications

References 300 publications

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches in Genome Analysis

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches in Genome Analysis

SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping

Taming Large-Scale Genomic Analyses via Sparsified Genomics

Contact Info

Product

Resources

About