Using long vector extensions for MPI reductions

Zhong, Dong; Cao, Qinglei; Bosilca, George; Dongarra, Jack

doi:10.1016/j.parco.2021.102871

Cited by 9 publications

(3 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Quicksort and Bitonic sort have also been optimized for the AVX-512 ISA [12], while stencil computations have been optimized for ARM SVE by Armejach et al [7]. Zhong et al [47] investigate the use of vector intrinsics in the context of MPI reductions, for both the AVX-512 and SVE ISAs.…”

Section: Related Workmentioning

confidence: 99%

Optimization of SpGEMM with Risc-V vector instructions

Fèvre¹,

Casas²

2023

Preprint

View full text Add to dashboard Cite

The Sparse GEneral Matrix-Matrix multiplication (SpGEMM) 𝐶 = 𝐴 × 𝐵 is a fundamental routine extensively used in domains like machine learning or graph analytics. Despite its relevance, the efficient execution of SpGEMM on vector architectures is a relatively unexplored topic. The most recent algorithm to run SpGEMM on these architectures is based on the SParse Accumulator (SPA) approach, and it is relatively efficient for sparse matrices featuring several tens of nonzero coefficients per column as it computes 𝐶 columns one by one. However, when dealing with matrices containing just a few non-zero coefficients per column, the state-of-the-art algorithm is not able to fully exploit long vector architectures when computing the SpGEMM kernel.To overcome this issue we propose the SPA paRallel with Sorting (SPARS) algorithm, which computes in parallel several 𝐶 columns among other optimizations, and the HASH algorithm, which uses dynamically sized hash tables to store intermediate output values. To combine the efficiency of SPA for relatively dense matrix blocks with the high performance that SPARS and HASH deliver for very sparse matrix blocks we propose H-SPA(𝑡) and H-HASH(𝑡), which dynamically switch between different algorithms. H-SPA(𝑡) and H-HASH(𝑡) obtain 1.24× and 1.57× average speed-ups with respect to SPA respectively, over a set of 40 sparse matrices obtained from the SuiteSparse Matrix Collection [19]. For the 22 most sparse matrices, H-SPA(𝑡) and H-HASH(𝑡) deliver 1.42× and 1.99× average speed-ups respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

Optimization of SpGEMM with Risc-V vector instructions

Fèvre¹,

Casas²

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…First, the data transfer between the host and the device during small data copying in heterogeneous nodes is mainly through a direct assignment. By increasing the bandwidth of the entire point-to-point data copy and reducing the number of copies within the node, the CPU of X86 architecture increases the bandwidth of the data copy instruction by using the highbandwidth data copy instruction that can be implemented in the multi-vector access instruction support 14 , thus reducing the memory competition during the communication of large data volume, reducing cache miss, and improving the efficiency of the point-to-point communication function in the heterogeneous runtime system. As shown in Figure 4, we propose to choose whether to utilize multi-vector access instructions to optimize the copy between heterogeneous devices based on the heterogeneous node architecture.…”

Section: Multi-vector Copy Instruction Optimizationmentioning

confidence: 99%

Optimization algorithm based on peer-to-peer communication within heterogeneous nodes

Shi

Xiao

2023

2022 2nd Conference on High Performance Computing and Communication Engineering (HPCCE 2022)

View full text Add to dashboard Cite

Heterogeneous runtime systems, which use multiple devices for different types of computations, are important in scientific research and engineering and are becoming more common. They provide efficient and effective solutions to complex problems. However, for some data-parallel applications, communication among heterogeneous devices within a compute node can easily become a performance bottleneck, hindering the collaborative execution of multiple devices. This is particularly true when the devices are of different types or capabilities and may require different communication protocols or approaches to achieve optimal performance. In this paper, firstly, for the current data copy APIs used in the runtime system, we suggest implementing an optimization for the small-size data copy between different devices using the multivector access instructions supported in the X86 architecture. Second, based on our analysis of the transfer rate during the two phases of the pipeline copy process for the page-locked memory buffer, we have found that the host-to-buffer transfer takes up a significant amount of time. To optimize the data communication between the host and page-locked memory buffer, we make full use of multi-vector access instructions to reduce the transfer time. This approach can help improve the efficiency of the transfer. Third, we analyzed the transfer latency for various buffer sizes and numbers of buffers for the pipeline copy of the page-locked memory buffer and determined the optimal buffer configuration for different transfer data, leading to a significant reduction in transfer latency. According to experiments conducted on current heterogeneous computing nodes, our proposed optimization has reduced the transfer latency to 40% of the original value for small data segments (4KB-4MB) and to 90% of the original value for medium-sized data segments (4MB-100MB). These optimizations allow us to achieve higher transmission bandwidth between different devices.

show abstract

“…one single CPU instruction can simultaneously process from two doubles (four floats) up to eight doubles (or sixteen floats), depending on the underlying architecture [49]. Vectorization can be easily applied to a parallel framework [50], which in our case results in the scheme shown in Figure 2.…”

Section: Matrix-free and Matrix-based Solversmentioning

confidence: 99%

A matrix-free high-order solver for the numerical solution of cardiac electrophysiology

Africa¹,

Salvador²,

Gervasio³

et al. 2022

Preprint

View full text Add to dashboard Cite

We propose a matrix-free solver for the numerical solution of the cardiac electrophysiology model consisting of the monodomain nonlinear reaction-diffusion equation coupled with a system of ordinary differential equations for the ionic species. Our numerical approximation is based on the high-order Spectral Element Method (SEM) to achieve accurate numerical discretization while employing a much smaller number of Degrees of Freedom than first-order Finite Elements. We combine sum-factorization with vectorization, thus allowing for a very efficient use of high-order polynomials in a high performance computing framework. We validate the effectiveness of our matrix-free solver in a variety of applications and perform different electrophysiological simulations ranging from a simple slab of cardiac tissue to a realistic four-chamber heart geometry. We compare SEM to SEM with Numerical Integration (SEM-NI), showing that they provide comparable results in terms of accuracy and efficiency. In both cases, increasing the local polynomial degree p leads to better numerical results and smaller computational times than reducing the mesh size h. We also implement a matrix-free Geometric Multigrid preconditioner that entails better performance in terms of linear solver iterations than state-of-the-art matrixbased Algebraic Multigrid preconditioners. As a matter of fact, the matrix-free solver here proposed yields up to 50× speed-up with respect to a conventional matrix-based solver.

show abstract

Using long vector extensions for MPI reductions

Cited by 9 publications

References 16 publications

Optimization of SpGEMM with Risc-V vector instructions

Optimization of SpGEMM with Risc-V vector instructions

Optimization algorithm based on peer-to-peer communication within heterogeneous nodes

A matrix-free high-order solver for the numerical solution of cardiac electrophysiology

Contact Info

Product

Resources

About