Performance extraction and suitability analysis of multi- and many-core architectures for next generation sequencing secondary analysis

Misra, Sanchit; Pan, Tony; Mahadik, Kanak; Powley, George; Vaidya, Priya N.; Vasimuddin,; Aluru, Srinivas

doi:10.1145/3243176.3243197

Cited by 11 publications

(20 citation statements)

References 65 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We demonstrate the efficacy of LISA by comparing the throughput (million-reads/sec) with FM-Index based exact search and SMEM search. For the baseline comparison, we use Trans-Omics Acceleration Library (TAL) which provides the architecture optimized implementations for traditional FM-index exact search and SMEM search [18,21,22]. The optimized SMEM kernel from TAL is also used in BWA-MEM2 [21], an [1] architecture-optimized implementation of BWA-MEM [4].…”

Section: Resultsmentioning

confidence: 99%

“…The key idea behind an FM-index is that, in the lexicographically sorted order of all suffixes of the reference sequence, all matches of a short DNA sequence (a.k.a., a "query") will fall in a single region matching the prefixes of contiguously located suffixes. Over the years, many improvements have been made to make the FM-index more efficient, leading to several state-of-the-art implementations that are highly cache-and processor-optimized [5,[7][8][9][10][12][13][14][15][16][17][18][19]. Hence, it becomes increasingly more challenging to further improve this critical step in the genomics pipeline to scale with increasing data growth.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

LISA: A Case For Learned Index based Acceleration of Biological Sequence Analysis

Kalikar

Misra

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

BackgroundNext-generation sequencing (NGS) technologies have enabled affordable sequencing of billions of short DNA fragments at high throughput, paving the way for population-scale genomics. Genomics data analytics at this scale requires overcoming performance bottlenecks, such as searching for short DNA sequences over long reference sequences.ResultsIn this paper, we introduce LISA (Learned Indexes for Sequence Analysis), a novel learning-based approach to DNA sequence search. We focus on accelerating two of the most essential flavors of DNA sequence search—exact search and super-maximal exact match (SMEM) search. LISA builds on and extends FM-index, which is the state-of-the-art technique widely deployed in genomics tools. Experiments with human, animal, and plant genome datasets indicate that LISA achieves up to 2.2× and 13.3× speedups over the state-of-the-art FM-index based implementations for exact search and super-maximal exact match (SMEM) search, respectively.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

LISA: A Case For Learned Index based Acceleration of Biological Sequence Analysis

Kalikar

Misra

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Implementation of these updates may make these results inapplicable to your device or system. significantly faster than its alternatives [15]. We also compare with doing backwards search using IP-BWT and binary search (i.e., without the RMI).…”

Section: Discussionmentioning

confidence: 99%

“…When evaluated on an Intel R Core TM i9-9900K 3.6 GHz processor, despite being single-threaded and not yet fully optimized to the underlying hardware architecture, our current implementation achieves nearly 4× speedup against a state-of-the-art single-threaded, CPU-optimized version of the FM-index based algorithm [15], for a workload of 50 million queries matched against the human genome. This early result shows that learned DNA sequence search is a promising idea 1 .…”

Section: Introductionmentioning

confidence: 99%

LISA: Towards Learned DNA Sequence Search

Ho,

Ding,

Misra

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Next-generation sequencing (NGS) technologies have enabled affordable sequencing of billions of short DNA fragments at high throughput, paving the way for population-scale genomics. Genomics data analytics at this scale requires overcoming performance bottlenecks, such as searching for short DNA sequences over long reference sequences. In this paper, we introduce LISA (Learned Indexes for Sequence Analysis), a novel learning-based approach to DNA sequence search. As a first proof of concept, we focus on accelerating one of the most essential flavors of the problem, called exact search. LISA builds on and extends FM-index, which is the state-of-the-art technique widely deployed in genomics toolchains. Initial experiments with human genome datasets indicate that LISA achieves up to a factor of 4× performance speedup against its traditional counterpart.The state-of-the-art technique to perform exact search is based on building an FM-index over the reference genome [8]. The key idea behind an FM-index is that, in the lexicographically sorted order of all suffixes of the reference sequence, all matches of a short DNA sequence (a.k.a., a "query") will fall in a single region matching the prefixes of contiguously located suffixes. Over the years, many improvements have been made to make the FM-index more efficient, leading to several state-of-the-art Workshop on Systems for ML at NeurIPS 2019,

show abstract

“…Computing score matrices is highly computeintensive, and consumes most of the time in sequential as well as our parallel algorithm. The proposed algorithm is inspired from previous optimization efforts targeted towards accelerating Smith-Waterman alignment using SIMD instructions [31], [32]. Alignment of a single sequence is called a task.…”

Section: Parallel Computation Of the Score Matrixmentioning

confidence: 99%

Accelerating Sequence Alignment to Graphs

Jain

Dilthey²,

Misra

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Aligning DNA sequences to an annotated reference is a key step for genotyping in biology. Recent scientific studies have demonstrated improved inference by aligning reads to a variation graph, i.e., a reference sequence augmented with known genetic variations. Given a variation graph in the form of a directed acyclic string graph, the sequence to graph alignment problem seeks to find the best matching path in the graph for an input query sequence. Solving this problem exactly using a sequential dynamic programming algorithm takes quadratic time in terms of the graph size and query length, making it difficult to scale to high throughput DNA sequencing data. In this work, we propose the first parallel algorithm for computing sequence to graph alignments that leverages multiple cores and single-instruction multiple-data (SIMD) operations. We take advantage of the available inter-task parallelism, and provide a novel blocked approach to compute the score matrix while ensuring high memory locality. Using a 48-core Intel Xeon Skylake processor, the proposed algorithm achieves peak performance of 317 billion cell updates per second (GCUPS), and demonstrates near linear weak and strong scaling on up to 48 cores. It delivers significant performance gains compared to existing algorithms, and results in run-time reduction from multiple days to three hours for the problem of optimally aligning high coverage long (PacBio/ONT) or short (Illumina) DNA reads to an MHC human variation graph containing 10 million vertices.Availability-The implementation of our algorithm is available at https://github.com/ParBLiSS/PaSGAL. Data sets used for evaluation are accessible using https://alurulab.cc.gatech.edu/PaSGAL.

show abstract

Performance extraction and suitability analysis of multi- and many-core architectures for next generation sequencing secondary analysis

Cited by 11 publications

References 65 publications

LISA: A Case For Learned Index based Acceleration of Biological Sequence Analysis

LISA: A Case For Learned Index based Acceleration of Biological Sequence Analysis

LISA: Towards Learned DNA Sequence Search

Accelerating Sequence Alignment to Graphs

Contact Info

Product

Resources

About