NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch

Valero-Lara, Pedro; Martínez-Pérez, Ivan; Sirvent, Raül; Martorell, Xavier; Peña, Antonio J.

doi:10.1007/978-3-319-78024-5_22

Cited by 16 publications

(20 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We evaluate the scalability of both approaches, gtsvStridedBatch and cuThomasBatch, for computing multiple and independent tridiagonal systems on NVIDIA GPUs. The present work extends the previously published works [13] with additional contributions. This work includes a new approach for the cuThomasBatch, which makes use of a unified vector in order to exploit the memory hierarchy more efficiently.…”

Section: Introductionsupporting

confidence: 83%

cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs

Valero-Lara

Martínez-Pérez

Sirvent

et al. 2018

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

The solving of tridiagonal systems is one of the most computationally expensive parts in many applications, so that multiple studies have explored the use of NVIDIA GPUs to accelerate such computation. However, these studies have mainly focused on using parallel algorithms to compute such systems, which can efficiently exploit the shared memory and are able to saturate the GPUs capacity with a low number of systems, presenting a poor scalability when dealing with a relatively high number of systems. The gtsvStridedBatch routine in the cuSPARSE NVIDIA package is one of these examples, which is used as reference in this article. We propose a new implementation (cuThomasBatch) based on the Thomas algorithm. Unlike other algorithms, the Thomas algorithm is sequential, and so a coarse-grained approach is implemented where one CUDA thread solves a complete tridiagonal system instead of one CUDA block as in gtsvStridedBatch. To achieve a good scalability using this approach, it is necessary to carry out a transformation in the way that the inputs are stored in memory to exploit coalescence (contiguous threads access to contiguous memory locations). Different variants regarding the transformation of the data are explored in detail. We also explore some variants for the case of variable batch, when the size of the systems of the batch have different size (cuThomasVBatch). The results given in this study prove that the implementations carried out in this work are able to beat the reference code, being up to 5× (in double precision) and 6× (in single precision) faster using the latest NVIDIA GPU architecture, the Pascal P100.

show abstract

Section: Introductionsupporting

confidence: 83%

cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs

Valero-Lara

Martínez-Pérez

Sirvent

et al. 2018

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

show abstract

“…The main contribution of this work is a novel and highly scalable implementation able to deal with multi-morphology simulations based on cuThomasBatch implementation [25]. Although in this paper the cuThomasBatch was proven to be a fast implementation for batches of full-tridiagonal systems, this is not enough to compute the sparsity found in Hines matrices.…”

Section: Related Workmentioning

confidence: 96%

Simulating the behavior of the Human Brain on GPUs

Valero-Lara

Martínez-Pérez

Sirvent

et al. 2018

Oil Gas Sci. Technol. – Rev. IFP Energies nouvelles

Self Cite

View full text Add to dashboard Cite

The simulation of the behavior of the Human Brain is one of the most important challenges in computing today. The main problem consists of finding efficient ways to manipulate and compute the huge volume of data that this kind of simulations need, using the current technology. In this sense, this work is focused on one of the main steps of such simulation, which consists of computing the Voltage on neurons’ morphology. This is carried out using the Hines Algorithm and, although this algorithm is the optimum method in terms of number of operations, it is in need of non-trivial modifications to be efficiently parallelized on GPUs. We proposed several optimizations to accelerate this algorithm on GPU-based architectures, exploring the limitations of both, method and architecture, to be able to solve efficiently a high number of Hines systems (neurons). Each of the optimizations are deeply analyzed and described. Two different approaches are studied, one for mono-morphology simulations (batch of neurons with the same shape) and one for multi-morphology simulations (batch of neurons where every neuron has a different shape). In mono-morphology simulations we obtain a good performance using just a single kernel to compute all the neurons. However this turns out to be inefficient on multi-morphology simulations. Unlike the previous scenario, in multi-morphology simulations a much more complex implementation is necessary to obtain a good performance. In this case, we must execute more than one single GPU kernel. In every execution (kernel call) one specific part of the batch of the neurons is solved. These parts can be seen as multiple and independent tridiagonal systems. Although the present paper is focused on the simulation of the behavior of the Human Brain, some of these techniques, in particular those related to the solving of tridiagonal systems, can be also used for multiple oil and gas simulations. Our studies have proven that the optimizations proposed in the present work can achieve high performance on those computations with a high number of neurons, being our GPU implementations about 4× and 8× faster than the OpenMP multicore implementation (16 cores), using one and two NVIDIA K80 GPUs respectively. Also, it is important to highlight that these optimizations can continue scaling, even when dealing with a very high number of neurons.

show abstract

“…In order to solve the above scheme in batched form on a GPU we follow the methodology of cuThomasBatch [1] with some modifications. We retain the key aspect of interleaved data layout, this means that the first row of the batch data will contain the first entry in each linear system Ax i = f i (the subscript i labels the different systems in the batch), the second row the second entry and so on.…”

Section: F Implementation On Gpumentioning

confidence: 99%

“…The starting-point for developing the batched pentadiagonal solver is an existing batched tridiagonal solver called cuThomasBatch [1], based on the Thomas Algorithm, and now part of the CUDA library as gtsvInterleavedBatch. We herein extend cuThomasBatch to accommodate pentadiagonal problems.…”

Section: Introductionmentioning

confidence: 99%

cuPentBatch—A batched pentadiagonal solver for NVIDIA GPUs

Gloster

Náraigh

2019

Computer Physics Communications

View full text Add to dashboard Cite

We introduce cuPentBatch -our own pentadiagonal solver for NVIDIA GPUs. The development of cuPentBatch has been motivated by applications involving numerical solutions of parabolic partial differential equations, which we describe. Our solver is written with batch processing in mind (as necessitated by parameter studies of various physical models). In particular, our solver is directed at those problems where only the right-hand side of the matrix changes as the batch solutions are generated. As such, we demonstrate that cuPentBatch outperforms the NVIDIA standard pentadiagonal batch solver gpsvInterleavedBatch for the class of physically-relevant computational problems encountered herein. Program SummaryProgram Title: cuPentBatch https://github.com/munstermonster/cuPentBatch Licensing Provision: Apache License 2.0

show abstract

NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch

Cited by 16 publications

References 8 publications

cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs

cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs

Simulating the behavior of the Human Brain on GPUs

cuPentBatch—A batched pentadiagonal solver for NVIDIA GPUs

Contact Info

Product

Resources

About