Multicore and Accelerator Development for a Leadership-Class Stellar Astrophysics Code

Messer, O. E. Bronson; Harris, J. Austin; Parete-Koon, Suzanne; Chertkow, Merek A.

doi:10.1007/978-3-642-36803-5_6

Cited by 27 publications

(13 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To exploit the parallelism and near-processor memory of the powerful high-performance computing systems, it is common in application programs to bind many small operations into a cumulatively large computation. This trend gives rise to the applications of batched matrix multiplications, which can be frequently found in, e.g., quantum chemistry [13], astrophysics [39], metabolic networks [32], computational fluid dynamics [44], domain decomposition solvers [10], tensor computations [45], and deep learning [6,11]. It has been proven that in these applications, the performance could be greatly improved by exploiting batched computations of small matrix multiplications [8,19,37].…”

Section: Introductionmentioning

confidence: 99%

Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

Jiang

Yang

2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

We present a systematic methodology for optimizing batched matrix multiplications on SW26010 many-core processor of the Sunway TaihuLight supercomputer. Five surrogate algorithms and a machine learning-based algorithm selector are proposed to fully exploit the computing capability of SW26010 and cope with the sophisticated algorithm characteristics of batched matrix multiplications. Experiment results show that the algorithm selector is able to adaptively choose the appropriate algorithm for various matrix shapes and batch sizes with low overhead and high accuracy. In particular, the optimized batched matrix multiplications can substantially outperform the non-batched version and reach around 84.8% of the performance upper bound.

show abstract

Section: Introductionmentioning

confidence: 99%

Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

Jiang

Yang

2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Moreover, there are good reasons to believe that neither improved compiler technology nor autotuning will make any significant headway on this problem. This lack of coverage by current library infrastructure is especially alarming because of the number of applications from important fields that fit this profile, including deep learning [8], data mining [31], astrophysics [23], image and signal processing [4], [24], hydrodynamics [10], quantum chemistry [5], and computational fluid dynamics (CFD) and the resulting partial differential equations (PDEs) through direct and multifrontal solvers [42], to name a few. Dramatically better performance on these applications can be achieved by using software that can repetitively execute small matrix/tensor operations grouped together in "batches."…”

Section: Introductionmentioning

confidence: 99%

“…Also, in combustion and astrophysics supernova applications [6], [7], [17], [23], [32], the study of a thermonuclear reaction networks (XNet package) requires the solution of many sparse linear systems of around 150 × 150. Furthermore, the need for batched routines can be illustrated in radar signal processing [4], where a batch of 200 × 200 QR decompositions is needed, as well as in hydrodynamic simulations [10], where thousands of matrix-matrix and matrix-vector (GEMV) products of matrices of around 100 × 100 are needed.…”

Section: Introductionmentioning

confidence: 99%

A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations

Haidar

Abdelfattah

Zounon

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups vs. CUBLAS of up to 6× for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU.

show abstract

“…Many numerical libraries and applications already use, and need this functionality further developed. For example, these are the tile algorithms from the area of dense linear algebra [2], various register and cache blocking techniques for sparse computations [11], sparse direct multifrontal solvers [30], high-order FEM [7], and numerous applications, e.g., from astrophysics [17], hydrodynamics [7], image processing [18], signal processing [5], etc.…”

Section: Introductionmentioning

confidence: 99%

Optimization for performance and energy for batched matrix computations on GPUs

Haidar

Dong

Łuszczek

et al. 2015

Proceedings of the 8th Workshop on General Purpose Processing Using GPUs

View full text Add to dashboard Cite

As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size independent problems. Many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the LU and Cholesky factorizations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. The goal of avoiding multicore CPU use, e.g., as in the hybrid CPU-GPU algorithms, is to exclusively benefit from the GPU's significantly higher energy efficiency, as well as from the removal of the costly CPU-to-GPU communications. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis and the use of profiling and tracing tools guided the development and optimization of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched LU factorization featured in the CUBLAS library for GPUs, we achieved up to 2.5 speedup on the K40 GPU.

show abstract

Multicore and Accelerator Development for a Leadership-Class Stellar Astrophysics Code

Cited by 27 publications

References 20 publications

Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations

Optimization for performance and energy for batched matrix computations on GPUs

Contact Info

Product

Resources

About