Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA

Grigoras, Paul; Burovskiy, Pavel; Luk, Wayne; Sherwin, Spencer J.

doi:10.1109/fpl.2016.7577352

Cited by 21 publications

(9 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, Dziekoński et al (2017) used a conjugate gradient solver and optimized matvec as its important part. Dehnavi et al (2010) and Grigoras et al (2016) presented similar findings and provided optimization strategies. However, in our case the action of the operator is local to the element.…”

Section: Introductionmentioning

confidence: 62%

Acceleration of tensor-product operations for high-order finite element methods

Świrydowicz

Chalmers

Karakus

et al. 2019

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

This paper is devoted to GPU kernel optimization and performance analysis of three tensorproduct operators arising in finite element methods. We provide a mathematical background to these operations and implementation details. Achieving close-to-the-peak performance for these operators requires extensive optimization because of the operators' properties: low arithmetic intensity, tiered structure, and the need to store intermediate results inside the kernel. We give a guided overview of optimization strategies and we present a performance model that allows us to compare the efficacy of these optimizations against an empirically calibrated roofline.

show abstract

Section: Introductionmentioning

confidence: 62%

Acceleration of tensor-product operations for high-order finite element methods

Świrydowicz

Chalmers

Karakus

et al. 2019

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

show abstract

“…Their implementation demonstrates up to 2× speedup in the best case, but hardly achieves any speedup on most of the matrices due to data format conversion overhead. A similar approach has been implemented by Grigoraş et al (2016) with a better speedup for FPGA architectures.…”

Section: Discussion Of Existing Studiesmentioning

confidence: 99%

Performance impact of precision reduction in sparse linear systems solvers

Zounon

Higham

Lucas

et al. 2022

PeerJ Computer Science

View full text Add to dashboard Cite

It is well established that reduced precision arithmetic can be exploited to accelerate the solution of dense linear systems. Typical examples are mixed precision algorithms that reduce the execution time and the energy consumption of parallel solvers for dense linear systems by factorizing a matrix at a precision lower than the working precision. Much less is known about the efficiency of reduced precision in parallel solvers for sparse linear systems, and existing work focuses on single core experiments. We evaluate the benefits of using single precision arithmetic in solving a double precision sparse linear system using multiple cores. We consider both direct methods and iterative methods and we focus on using single precision for the key components of LU factorization and matrix–vector products. Our results show that the anticipated speedup of 2 over a double precision LU factorization is obtained only for the very largest of our test problems. We point out two key factors underlying the poor speedup. First, we find that single precision sparse LU factorization is prone to a severe loss of performance due to the intrusion of subnormal numbers. We identify a mechanism that allows cascading fill-ins to generate subnormal numbers and show that automatically flushing subnormals to zero avoids the performance penalties. The second factor is the lack of parallelism in the analysis and reordering phases of the solvers and the absence of floating-point arithmetic in these phases. For iterative solvers, we find that for the majority of the matrices computing or applying incomplete factorization preconditioners in single precision provides at best modest performance benefits compared with the use of double precision. We also find that using single precision for the matrix–vector product kernels provides an average speedup of 1.5 over double precision kernels. In both cases some form of refinement is needed to raise the single precision results to double precision accuracy, which will reduce performance gains.

show abstract

“…[50] proposed a new sparse matrix storage method called "BVCSR" to compress the indices of non-zero elements, thus increasing the valid bandwidth of FPGA. [51] and [52] proposed an architecture for large-scale SpMV in the FEM problem. [51] co-designed an FPGA SpMV architecture with a matrix stripping and partitioning algorithms that enable the architecture to process arbitrarily large matrices without changing the PE quantities.…”

Section: Related Workmentioning

confidence: 99%

SpArch: Efficient Architecture for Sparse Matrix Multiplication

Zhang

Wang

Han

et al. 2020

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

178

View full text Add to dashboard Cite

Generalized Sparse Matrix-Matrix Multiplication (SpGEMM) is a ubiquitous task in various engineering and scientific applications. However, inner product based SpGEMM introduces redundant input fetches for mismatched nonzero operands, while outer product based approach [1] suffers from poor output locality due to numerous partial product matrices. Inefficiency in the reuse of either inputs or outputs data leads to extensive and expensive DRAM access.To address this problem, this paper proposes an efficient sparse matrix multiplication accelerator architecture, SpArch, which jointly optimizes the data locality for both input and output matrices. We first design a highly parallelized streamingbased merger to pipeline the multiply and merge stage of partial matrices so that partial matrices are merged on chip immediately after produced. We then propose a condensed matrix representation that reduces the number of partial matrices by three orders of magnitude and thus reduces DRAM access by 5.4×. We further develop a Huffman tree scheduler to improve the scalability of the merger for larger sparse matrices, which reduces the DRAM access by another 1.8×. We also resolve the increased input matrix read induced by the new representation using a row prefetcher with near-optimal buffer replacement policy, further reducing the DRAM access by 1.5×. Evaluated on 20 benchmarks, SpArch reduces the total DRAM access by 2.8× over previous state-of-the-art. On average, SpArch achieves 4×, 19×, 18×, 17×, 1285× speedup and 6×, 164×, 435×, 307×, 62× energy savings over OuterSPACE, MKL, cuSPARSE, CUSP, and ARM Armadillo, respectively.

show abstract

Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA

Cited by 21 publications

References 23 publications

Acceleration of tensor-product operations for high-order finite element methods

Acceleration of tensor-product operations for high-order finite element methods

Performance impact of precision reduction in sparse linear systems solvers

SpArch: Efficient Architecture for Sparse Matrix Multiplication

Contact Info

Product

Resources

About