2020
DOI: 10.1098/rsta.2019.0394
|View full text |Cite
|
Sign up to set email alerts
|

The parallelism motifs of genomic data analysis

Abstract: Genomic datasets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require large-scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different r… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3

Relationship

2
5

Authors

Journals

citations
Cited by 13 publications
(10 citation statements)
references
References 72 publications
0
8
0
Order By: Relevance
“…Our GPU optimizations effectively turn a compute-bound problem into one dominated by communication. In particular, many-to-many k-mer exchange for redistributing k-mers tends to be the secondary bottleneck at small scales and the primary bottleneck at large scale of distributed memory kmer counters [7], [10], [22], [33]. Our novel use of supermers in distributed memory parallelization is combined with GPU optimizations, improving communication costs by reducing communication volume.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Our GPU optimizations effectively turn a compute-bound problem into one dominated by communication. In particular, many-to-many k-mer exchange for redistributing k-mers tends to be the secondary bottleneck at small scales and the primary bottleneck at large scale of distributed memory kmer counters [7], [10], [22], [33]. Our novel use of supermers in distributed memory parallelization is combined with GPU optimizations, improving communication costs by reducing communication volume.…”
Section: Discussionmentioning
confidence: 99%
“…In particular, the distribution of k-mers is not fixed across biological input datasets and cannot be determined until the run time. The primary methods for scalable-distributed memory k-mer counting rely on distributed hash tables [6], [7], [10], [12], [21], [33].…”
Section: Introductionmentioning
confidence: 99%
“…Our algorithm will boost many applications in genomics, scientific computing, and social network analysis where SpGEMM has emerged as a key computational kernel. For example, Yelick et al [39] regarded SpGEMM as a parallelism motif of genomic data analysis with applications in alignment, profiling, clustering and assembly for both single genomes and metagenomes. With the size of genomic data growing exponentially, extreme-scale SpGEMM presented in this paper will enable rapid scientific discoveries in these applications.…”
Section: Discussionmentioning
confidence: 99%
“…Memory needs can approach the terabyte scale, and therefore, the computations commonly use compute clusters with >32 central processing units (CPUs) and >4 GB of random access memory per CPU. GPU computing and edge computing are popularly used due to the inherent single instruction multiple data (SIMD) nature of the computation . Download and storage of the data can also require petabyte-scale systems such as those hosted by the National Microbiome Data Collaborative …”
Section: Machine Learning Methods Applied To Habsmentioning
confidence: 99%