Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides

Liu, Weifeng; Li, Ang; Hogg, JD; Duff, Iain S.; Vinter, Brian

doi:10.1002/cpe.4244

Cited by 43 publications

(25 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, Liu et al presented a self ‐ scheduled two‐stage GPU method for matrices in the CSC format based on a light‐weight analysis phase that avoids the constant synchronization with the CPU implied by the launching of kernels to compute the different level‐sets in the cuSparse routine. To our best knowledge, there are no other significant works that apply this type of algorithms to solve sparse triangular systems on hardware accelerators.…”

Section: Related Workmentioning

confidence: 99%

“…In order to keep track of this, we store an integer ready vector, which has an entry for each unknown, that is set to one if it has been solved, and is equal to zero otherwise. Unlike the work presented by Liu et al, our algorithm is especially tailored for the CSR matrix format; it avoids the usage of atomic operations and does not require a preprocessing stage.…”

Section: Proposalmentioning

confidence: 99%

“…Second, the solution stage suffers from constant synchronizations with the CPU due to kernel launching, which can turn into a significant overhead . As an alternative, a synchronization‐free method, was recently proposed by Liu et al, based on a dynamically‐scheduled strategy that involves only a light‐weight analysis of the matrix before the solution phase. However, it has the potential disadvantage of making an extensive use of GPU atomic operations, and it is intended for sparse matrices stored in the CSC format, which is not as ubiquitous as CSR.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Using analysis information in the synchronization‐free GPU solution of sparse triangular systems

Dufrechou

Ezzatti

2019

Concurrency and Computation

View full text Add to dashboard Cite

The solution of sparse triangular linear systems is one of the most important building blocks for a large number of science and engineering problems. For these reasons, it has been studied steadily for several decades, principally in order to take advantage of emerging parallel platforms.In the context of massively parallel platforms such as GPUs, the standard strategy of parallel solution is based on performing a level-set analysis of the sparse matrix, and the kernel included in the NVIDIA CUSPARSE library is the most prominent example of this approach. However, a weak spot of this implementation is the costly analysis phase and the constant synchronizations with the CPU during the solution stage. In previous work, we presented a self-scheduled and synchronization-free GPU algorithm that avoided the analysis phase and the synchronizations of the standard approach. Here, we extend this proposal and show how the level-set information can be leveraged to improve its performance. In particular, we present new GPU solution routines that attack some of the weak spots of the self-scheduled solver, such as the under-utilization of the GPU resources in the case of highly sparse matrices. The experimental evaluation reveals a sensible runtime reduction over CUSPARSE and the state-of-the-art synchronization-free method. KEYWORDSgraphics processors (GPUs), level-set analysis, Sparse triangular linear systems, synchronization-free methods INTRODUCTIONMany essential sparse numerical linear algebra algorithms, such as the application of preconditioners based on incomplete factorizations, or the direct solution of sparse linear systems and least squares problems, imply the solution of sparse triangular linear systems as one of the most important building blocks. 1,2 This is the main motivation for the strong attention that this kernel has raised for many years, as well as the numerous efforts to develop efficient implementations for a variety of parallel platforms.This operation is challenging from the point of view of its parallelization as, in the general case, the elimination of one unknown (row) depends on the previous elimination of others. In addition, the triangular structure of the matrices can be the origin of load imbalance issues.There are two main approaches for the parallel solution of sparse triangular systems (the SPTRSV operation). On one side, we find two-stage methods, based on performing an analysis of the sparse matrix previous to the solution stage, to determine a scheduling for the elimination of the unknowns that reveals as much parallelism as possible. On the other side, there are one-stage methods, based on a self-scheduled pool of tasks, in which some of the tasks have to wait until the data necessary to perform their computations is made available by other tasks. The advantages of one paradigm or the other depends on the characteristics of the particular sparse matrix, so it results impossible to determine which is the best one. 3Graphics processors occupy a special place among the most important par...

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Proposalmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Using analysis information in the synchronization‐free GPU solution of sparse triangular systems

Dufrechou

Ezzatti

2019

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…Compared to stochastic gradient descent (SGD) [8,9], the ALS algorithm is not only inherently parallel, but can incorporate implicit ratings [1]. Nevertheless, the ALS algorithm involves parallel sparse matrix manipulation [10] which is challenging to achieve high performance due to imbalanced workload [11,12,13], random memory access [14,15], unpredictable amount of computations [16] and task dependency [17,18,19]. This particularly holds when parallelizing and optimizing ALS on modern multi-cores and many-cores [20].…”

Section: Introductionmentioning

confidence: 99%

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Chen

Fang

Liu

et al. 2020

Future Generation Computer Systems

Self Cite

View full text Add to dashboard Cite

Alternating least squares (ALS) has been proved to be an effective solver for matrix factorization in recommender systems. To speed up factorizing performance, various parallel ALS solvers have been proposed to leverage modern multi-cores and many-cores. Existing implementations are limited in either speed or portability. In this paper, we present an efficient and portable ALS solver (clMF) for recommender systems. On one hand, we diagnose the baseline implementation and observe that it lacks of the awareness of the hierarchical thread organization on modern hardware. To achieve high performance, we apply the thread batching technique, the fine-grained tiling technique and three architecture-specific optimizations. On the other hand, we implement the ALS solver in OpenCL so that it can run on various platforms (CPUs, GPUs and MICs). Based on the architectural specifics, we select a suitable code variant for each platform to efficiently map it to the underlying hardware. The experimental results show that our implementation performs 2.8×-15.7× faster on an Intel 16-core CPU, 23.9×-87.9× faster on an NVIDIA K20C GPU and 34.6×-97.1× faster on an AMD Fury X GPU than the baseline implementation. On the K20C GPU, our implementation also outperforms cuMF over different latent features ranging from 10 to 100 with various real-world recommendation datasets.

show abstract

“…To the best of our knowledge, most existing parallel triangular solvers are targeting shared-memory machines or GPU (see [2,16,17,25] and references, therein). These solvers often rely on well-known techniques such as the level-set, color-set, or block scheduling algorithms.…”

Section: Introductionmentioning

confidence: 99%

Highly scalable distributed-memory sparse triangular solution algorithms

Liu¹,

Jacquelin²,

Ghysels³

et al. 2018

2018 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing

View full text Add to dashboard Cite

This paper presents a highly e cient distributed-memory parallel sparse triangular solver. The triangular solution phase is often performed following factorization phase in the sparse linear solvers and has become increasingly computationally expensive for direct solvers with many right hand sides (RHSs) or preconditioned iterative solvers. However, the low arithmetic intensity and sequential nature of the triangular solve algorithm pose performance challenges for its large-scale distributed-memory parallelization. In this work, we propose several strategies to enhance scalability of an algorithm with 2D block cyclic process layout. First, an asynchronous binary-tree-based communication scheme implemented via non-blocking MPI functions is leveraged to broadcast partial solutions and reduce partial updates among a subset of processes for each block column and row of the triangular matrix, respectively. This scheme reduces message latency, improves communication load balance and significantly accelerates asynchronous execution of the triangular solve. In addition, e cient BLAS operations and threading implementations are exploited to accelerate local computations and further reduce process idle time. The proposed strategies are implemented in SuperLU DIST and numerical experiments show up to 4.4x improvement with one right-hand side and up to 6.1x improvement with 50 righthand sides on 4096 processes, compared to the current release. This is the first time that sparse triangular solution is demonstrated strong scaling on more than 4000 cores. IntroductionFactorization based sparse solvers and preconditioners are indispensable methods for solving large-scale algebraic systems arising from multiphysics and multiscale simulations. Here, we focus on LU factorization A = LU followed by triangular solutions with the lower and upper triangular matrices L and U . For many typical sparse matrices from 3D discretized partial di↵er-ential equations with n variables, the operation cost for factorization is O(n 2 ), and that for triangular solution

show abstract

Fast synchronization‐free algorithms for parallel sparse triangular solves with multiple right‐hand sides

Cited by 43 publications

References 44 publications

Using analysis information in the synchronization‐free GPU solution of sparse triangular systems

Using analysis information in the synchronization‐free GPU solution of sparse triangular systems

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Highly scalable distributed-memory sparse triangular solution algorithms

Contact Info

Product

Resources

About