Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures

Monakov, A.; Lokhmotov, Anton; Avetisyan, Arutyun

doi:10.1007/978-3-642-11515-8_10

Cited by 197 publications

(162 citation statements)

References 4 publications

Supporting

Mentioning

161

Contrasting

Unclassified

Order By: Relevance

“…Note that the definition of the first operator depends on the storage format used for the convective operator. An example for the sliced ELLPACK format is shown in [21].…”

Section: Governing Equations and Numerical Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Efficient CFD code implementation for the ARM-based Mont-Blanc architecture

Oyarzun

Borrell

Gorobets

et al. 2018

Future Generation Computer Systems

View full text Add to dashboard Cite

h i g h l i g h t s• Termo Fluids CFD code has been run on up to 128 ARM-based Mont-Blanc nodes.• An heterogeneous implementation has been developed to occupy the overall system.• A dynamic Tabu search load balance algorithm distributes the workload among devices.• The hybrid approach is up to times faster than the CPU-only version of the code.• Mont-Blanc nodes are 41% more energy efficient than Minotauro hybrid nodes. t r a c tSince 2011, the European project Mont-Blanc has been focused on enabling ARM-based technology for HPC, developing both hardware platforms and system software. The latest Mont-Blanc prototypes use system-on-chip (SoC) devices that combine a CPU and a GPU sharing a common main memory. Specific developments of parallel computing software and well-suited implementation approaches are of crucial importance for such a heterogeneous architecture in order to efficiently exploit its potential. This paper is devoted to the optimizations carried out in the TermoFluids CFD code to efficiently run it on the Mont-Blanc system. The underlying numerical method is based on an unstructured finite-volume discretization of the Navier-Stokes equations for the numerical simulation of incompressible turbulent flows. It is implemented using a portable and modular operational approach based on a minimal set of linear algebra operations. An architecture-specific heterogeneous multilevel MPI+OpenMP+OpenCL implementation of such kernels is proposed. It includes optimizations of the storage formats, dynamic load balancing between the CPU and GPU devices and hiding of communication overheads by overlapping computations and data transfers. A detailed performance study shows time reductions of up to 2.1× on the kernels' execution with the new heterogeneous implementation, its scalability on up to 128 MontBlanc nodes and the energy savings (around 40%) achieved with the Mont-Blanc system versus the highend hybrid supercomputer MinoTauro.

show abstract

“…Note that the definition of the first operator depends on the storage format used for the convective operator. An example for the sliced ELLPACK format is shown in [21].…”

Section: Governing Equations and Numerical Methodsmentioning

confidence: 99%

“…It consists in sorting the rows by the number of entries and then divide the matrix into slices which are themselves stored using the ELLPACK format. More details can be found in [18,21]. A performance comparison between the CSR and the sELL in our application context is presented in the next section.…”

Section: Intra-device Optimizationmentioning

confidence: 99%

Efficient CFD code implementation for the ARM-based Mont-Blanc architecture

Oyarzun

Borrell

Gorobets

et al. 2018

Future Generation Computer Systems

View full text Add to dashboard Cite

show abstract

“…Monakov et al [5] put forward a sliced ELL format and used auto-tuning to find the optimal configuration for batter performance. Zheng and Gu [6] proposed bisection ELL (BiELL) and bisection JAD (BiJAD) format based on ELL and JAD format for optimizing SpMV on GPUs.…”

Section: Introductionmentioning

confidence: 99%

Performance Prediction Based on Statistics of Sparse Matrix-Vector Multiplication on GPUs

Wang¹,

Gu²,

Li³

2017

JCC

View full text Add to dashboard Cite

As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo and Wang put forward a new idea to predict the performance of SpMV on GPUs. However, they didn't consider the matrix structure completely, so the execution time predicted by their model tends to be inaccurate for general sparse matrix. To address this problem, we proposed two new similar models, which take into account the structure of the matrices and make the performance prediction model more accurate. In addition, we predict the execution time of SpMV for CSR-V, CSR-S, ELL and JAD sparse matrix storage formats by the new models on the CUDA platform. Our experimental results show that the accuracy of prediction by our models is 1.69 times better than Guo and Wang's model on average for most general matrices.

show abstract

“…In the resulting sliced ELLPACK format (SELL or SELL-C where C denotes the size of the row blocks [123,127]), the overhead is no longer determined by the matrix row containing the largest number of nonzeros, but by the row with the largest number of nonzero elements in the respective block.…”

Section: Graphics Acceleratorsmentioning

confidence: 99%

“…, en el que se incorporan versiones optimizadas del CG, principalmente centradas en la operación SpMV sobre las mismas arquitecturas. En concreto, para las arquitecturas multinúcleo se utilizaron los formatos CSR, BCSR y CSB [192,193,194,195,196], mientras que para las GPUs se utilizaron los formatos ELLPACK, ELLR_T y SELL-P [197,198,199,200,201], y también se incluyó el "fusionado de kernels CUDA". Además, en el estudio se utilizó aritmética de DP, aunque como complemento final, también se comprobó el uso de SP para la GPU (Kepler) y un procesador de propósito general (Intel Bridge).…”

Section: Análisis De Arquitecturas Paralelasunclassified

Consumo energético de métodos iterativos para sistemas dispersos en procesadores gráficos

Badenes¹

View full text Add to dashboard Cite

Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures

Cited by 197 publications

References 4 publications

Efficient CFD code implementation for the ARM-based Mont-Blanc architecture

Efficient CFD code implementation for the ARM-based Mont-Blanc architecture

Performance Prediction Based on Statistics of Sparse Matrix-Vector Multiplication on GPUs

Consumo energético de métodos iterativos para sistemas dispersos en procesadores gráficos

Contact Info

Product

Resources

About