Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors

Nagasaka, Yusuke; Matsuoka, Satoshi; Azad, Ariful; Buluç, Aydın

doi:10.1016/j.parco.2019.102545

Cited by 36 publications

(63 citation statements)

References 30 publications

(59 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…HipMCL is an iterative algorithm that relies on SpGEMM as its workhorse at each iteration. The ExaGraph project ported the hash-based SpGEMM algorithm, which was originally developed for GPUs by collaborators (Nagasaka et al, 2017), into multicore CPUs and Intel KNLs (Nagasaka et al, 2019). For GPU-equipped clusters, we developed a model to choose the fastest GPU-based SpGEMM depending on the sparsity of the current MCL iteration and utilized a pipelined communication scheme that hides the cost of CPU-to-GPU data transfers.…”

Section: Algebraic Approaches For Graph Algorithms and Combinatorial Problemsmentioning

confidence: 99%

EXAGRAPH: Graph and combinatorial methods for enabling exascale applications

Acer

Azad

Boman

et al. 2021

The International Journal of High Performance Computing Applica

Self Cite

View full text Add to dashboard Cite

Combinatorial algorithms in general and graph algorithms in particular play a critical enabling role in numerous scientific applications. However, the irregular memory access nature of these algorithms makes them one of the hardest algorithmic kernels to implement on parallel systems. With tens of billions of hardware threads and deep memory hierarchies, the exascale computing systems in particular pose extreme challenges in scaling graph algorithms. The codesign center on combinatorial algorithms, ExaGraph, was established to design and develop methods and techniques for efficient implementation of key combinatorial (graph) algorithms chosen from a diverse set of exascale applications. Algebraic and combinatorial methods have a complementary role in the advancement of computational science and engineering, including playing an enabling role on each other. In this paper, we survey the algorithmic and software development activities performed under the auspices of ExaGraph from both a combinatorial and an algebraic perspective. In particular, we detail our recent efforts in porting the algorithms to manycore accelerator (GPU) architectures. We also provide a brief survey of the applications that have benefited from the scalable implementations of different combinatorial algorithms to enable scientific discovery at scale. We believe that several applications will benefit from the algorithmic and software tools developed by the ExaGraph team.

show abstract

Section: Algebraic Approaches For Graph Algorithms and Combinatorial Problemsmentioning

confidence: 99%

EXAGRAPH: Graph and combinatorial methods for enabling exascale applications

Acer

Azad

Boman

et al. 2021

The International Journal of High Performance Computing Applica

Self Cite

View full text Add to dashboard Cite

show abstract

“…The high-performance distributed re-implementation of the Markov Cluster algorithm, known as HipMCL [40], uses some of the most general and scalable sparse matrix algorithms implemented within the Combinatorial BLAS [41]. These algorithms include a two-dimensional SpGEMM algorithm known as Sparse SUMMA [24], several different shared memory SpGEMM algorithms [42] that are optimized for different iterations of HipMCL, a fast memory estimator based on sparse matrix dense matrix multiplication for memory-efficient SpGEMM [43], as well as a very fast distributed memory connected components algorithm [44] that is used for extracting the final clusters from the result of the HipMCL iterations. The integration of GPU support as well as faster communication-avoiding SpGEMM algorithms [45] is ongoing work.…”

Section: (E) Sparse Matrix Operations For Protein Clusteringmentioning

confidence: 99%

“…These methods do not simply partition the result matrix or particles/sequences over processors, but instead replicate them to the extent allowed by available memory. For sparse matrices and sparse interactions, the benefits depend more on the sparsity patterns [24,42,43,70], but are useful in clustering [40] and possibly alignment.…”

Section: Hardware and Software Support For Parallel Genome Analysismentioning

confidence: 99%

The parallelism motifs of genomic data analysis

Yelick

Buluç

Awan

et al. 2020

Phil. Trans. R. Soc. A.

Self Cite

View full text Add to dashboard Cite

Genomic datasets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require large-scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different requirements on programming support, software libraries and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high-performance genomics analysis, including alignment, profiling, clustering and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or ‘motifs’ that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

show abstract

“…The implementation of this method within our pipeline enables the use of high-performance techniques previously not applied in the context of long-read alignment. It also allows continuing performance improvements in this step due to the ever-improving optimized implementations of SpGEMM (Nagasaka et al, 2019;Deveci et al, 2017).…”

Section: Proposed Algorithmmentioning

confidence: 99%

“…More importantly, the computational problem of accumulating the contributions from multiple shared k-mers to each pair of reads is handled automatically by the choice of appropriate data structures within SpGEMM. Figure 2 illustrates the merging operation of BELLA, which uses a hash table data structure indexed by the row indexes of A, following the multi-threaded implementation proposed by Nagasaka et al (2019). Finally, the contents of the hash table are stored into a column of the final matrix once all required nonzeros for that column are accumulated.…”

Section: Sparse Matrix Construction and Multiplicationmentioning

confidence: 99%

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper

Guidi

Ellis

Rokhsar

et al. 2018

Preprint

Self Cite

View full text Add to dashboard Cite

Recent advances in long-read sequencing enable the characterization of genome structure and its intra-and inter-species variation at a resolution that was previously impossible. Detecting overlaps between reads is integral to many long-read genomics pipelines, such as de novo genome assembly. While longer reads simplify genome assembly and improve the contiguity of the reconstruction, current long-read technologies come with high error rates. We present Berkeley Long-Read to Long-Read Aligner and Overlapper (BELLA), a novel algorithm for computing overlaps and alignments via sparse matrix-matrix multiplication that balances the goals of recall and precision, performing well on both.We present a probabilistic model that demonstrates the feasibility of using short k-mers for detecting candidate overlaps. We then introduce a notion of reliable k-mers based on our probabilistic model. Combining reliable k-mers with our binning mechanism eliminates both the k-mer set explosion that would otherwise occur with highly erroneous reads, and the spurious overlaps from k-mers originating in repetitive regions. Finally, we present a new method based on Chernoff bounds for separating true overlaps from false positives using a combination of alignment techniques and probabilistic modeling. Our methodologies aim at maximizing the balance between precision and recall. On both real and synthetic data, BELLA performs amongst the best in terms of F1 score, showing a performance stability which is often missing for competitor software. BELLA's F1 score is consistently within 1.7% of the top entry. Notably, we show improved de novo assembly results on synthetic data when coupling BELLA with the Miniasm assembler.Long-read technologies (Eid et al., 2009;Goodwin et al., 2015) generate long reads with average lengths reaching and often exceeding 10,000 base pairs (bp). These allow the resolution of complex genomic repetitions, enabling more accurate ensemble views that were not possible with previous short-read technologies (Phillippy et al., 2008;Nagarajan and Pop, 2009). However, the improved read length of these technologies comes at the cost of lower accuracy, with error rates ranging from 5% to 35%. Nevertheless, errors are more random and more evenly distributed within Pacific Biosciences long-read data (Giordano et al., 2017) compared to short-read technologies.The majority of the state-of-the-art long-read assemblers uses the Overlap-Layout-Consensus (OLC) paradigm (Berlin et al., 2015). The first step in OLC assembly consists of detecting overlaps between reads to construct an overlap (or string) graph. The OLC paradigm benefits from longer reads as significantly fewer reads are required to cover the genome, limiting the size of the overlap graph. Highly-accurate overlap detection is a major computational bottleneck in OLC assembly (Myers, 2014), mainly due to the compute-intensive nature of pairwise alignment.At present, several algorithms are capable of overlapping error-prone long-read data with varying accuracy. The prevaili...

show abstract

Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors

Cited by 36 publications

References 30 publications

EXAGRAPH: Graph and combinatorial methods for enabling exascale applications

EXAGRAPH: Graph and combinatorial methods for enabling exascale applications

The parallelism motifs of genomic data analysis

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper

Contact Info

Product

Resources

About