An Extended Compression Format for the Optimization of Sparse Matrix-Vector Multiplication

Karakasis, Vasileios; Gkountouvas, Theodoros; Kourtis, Kornilios; Goumas, Georgios; Koziris, Nectarios

doi:10.1109/tpds.2012.290

Cited by 42 publications

(36 citation statements)

References 27 publications

(54 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…A delta unit in CSX is a sequence of column indices that can be represented by a specified number of bits (namely, 8, 16 or 32 bits). A detailed description and performance evaluation of CSX can be found in [16].…”

Section: Extending Csx To Symmetric Matrices a Overview Of The Cmentioning

confidence: 99%

“…For the parallelization of the SpM×V routines and the preprocessing phase of CSX, we used explicit, native threading with the Pthreads library (NPTL 2.7) and bound the threads to specific logical processors using the Linux kernel's system call interface. Finally, for the NUMA-aware implementations, we used the numactl library, version 2.0.7, in conjunction with our low-level interleaved allocator [16].…”

Section: Experimental Evaluation a Experimental Setupmentioning

confidence: 99%

“…However, thanks to a careful implementation and the use of advanced matrix sampling techniques, the cost of CSX preprocessing is rather contained [16]. Indeed, the preprocessing cost in Dunnington and Gainestown, using 24 and 16 threads, respectively, amounts to 49 and 94 serial SpM×V operations in CSR format on average, while these numbers are slightly increased to 59 and 115 operations for the set of reordered matrices, a rather expected increase, since the serial SpM×V execution is considerably reduced in this case.…”

Section: E Preprocessing Cost Of Csx-symmentioning

confidence: 99%

“…Indeed, the preprocessing cost in Dunnington and Gainestown, using 24 and 16 threads, respectively, amounts to 49 and 94 serial SpM×V operations in CSR format on average, while these numbers are slightly increased to 59 and 115 operations for the set of reordered matrices, a rather expected increase, since the serial SpM×V execution is considerably reduced in this case. The higher numbers for Gainestown are due to the more elaborate preprocessing in NUMA machines needed for balancing the compression benefit and the decompression overhead [16].…”

Section: E Preprocessing Cost Of Csx-symmentioning

confidence: 99%

See 3 more Smart Citations

Improving the Performance of the Symmetric Sparse Matrix-Vector Multiplication in Multicore

Gkountouvas

Karakasis

Kourtis

et al. 2013

2013 IEEE 27th International Symposium on Parallel and Distributed Processing

Self Cite

View full text Add to dashboard Cite

Symmetric sparse matrices arise often in the solution of sparse linear systems. Exploiting the non-zero element symmetry in order to reduce the overall matrix size is very tempting for optimizing the symmetric Sparse Matrix-Vector Multiplication kernel (SpM×V) for multicore architectures. Despite being very beneficial for the single-threaded execution, not storing the upper or lower triangular part of a symmetric sparse matrix complicates the multithreaded SpM×V version, since it introduces an undesirable dependency on the output vector elements. The most common approach for overcoming this problem is to use local, per-thread vectors, which are reduced to the output vector at the end of the computation. However, this reduction leads to considerable memory traffic, limiting the scalability of the symmetric SpM×V. In this paper, we take a two-step approach in optimizing the symmetric SpM×V kernel. First, we introduce the CSX-Sym variant of the highly compressed CSX format, which exploits the non-zero element symmetry for compressing further the input matrix. Second, we minimize the memory traffic produced by the local vectors reduction phase by implementing a non-zero indexing compression scheme that minimizes the local data to be reduced. Our indexing scheme allowed the scaling of symmetric SpM×V and provided a more than 2× performance improvement over the baseline CSR implementation and 83.9% over the typical symmetric SpM×V kernel. The CSX-Sym variant has further increased the symmetric SpM×V performance by 43.4%. Finally, we evaluate the effect of our optimizations in the context of the CG iterative method, where we achieve an 77.8% acceleration of the overall solver.

show abstract

Section: Extending Csx To Symmetric Matrices a Overview Of The Cmentioning

confidence: 99%

Section: Experimental Evaluation a Experimental Setupmentioning

confidence: 99%

Section: E Preprocessing Cost Of Csx-symmentioning

confidence: 99%

Section: E Preprocessing Cost Of Csx-symmentioning

confidence: 99%

See 2 more Smart Citations

Improving the Performance of the Symmetric Sparse Matrix-Vector Multiplication in Multicore

Gkountouvas

Karakasis

Kourtis

et al. 2013

2013 IEEE 27th International Symposium on Parallel and Distributed Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…The increases in the density and speed of field-programmable gate arrays (FPGAs) [1] make them attractive as flexible and high-speed alternatives to DSPs [3] and ASICs. It is a highly procedure oriented computation [6], there is only one way to multiply two matrices and it involves lots of multiplications and additions. But the simple part of matrix multiplication is that the evaluation of elements of the resultant elements can be done independent of the other, this point to distributed memory approach.…”

Section: Introductionmentioning

confidence: 99%

FPGA Implementation of Latency, Computational time Improvements in Matrix Multiplication

Jain¹,

Kumar²,

Singh³

et al. 2014

IJCA

View full text Add to dashboard Cite

Matrix operations, like matrix multiplication, are commonly used in almost all areas of scientific research. Matrix multiplication has significant application in the areas of graph theory, numerical algorithms, signal processing, and digital control. Matrix multiplication is a computationally intensive problem, especially the design and efficient implementation on an FPGA where resources are very limited, has been more demanding. FPGA based designs are usually evaluated using three performance metrics: speed (latency), area, and power (energy). Fixed point implementations in FPGA are fast and have minimal power consumption. With today's applications requiring ever higher computational throughputs, distributed memory approach is an effective solution for real-time applications. This application shows how to achieve higher computational throughput via parallel processing with the DSP processors. The matrix-vector multiplication applied to calculate linear convolution. This paper presents an FPGAbased hardware realization of matrix multiplication based on distributed memory approach architecture. We propose an architecture that is capable of handling matrices of variable sizes our designs minimize the gate count, area, improvements in latency, computational time, and throughput for performing matrix multiplication and reduces the number of multiplication and additions hardware required to get the matrices multiplied on commercially available FPGA devices.

show abstract

A novel multi–graphics processing unit parallel optimization framework for the sparse matrix‐vector multiplication

Gao

Wang

2016

Concurrency and Computation

View full text Add to dashboard Cite

Summary The sparse matrix‐vector multiplication (SpMV) is of great importance in scientific computations. Graphics processing unit (GPU)‐accelerated SpMVs for large‐sized problems have attracted considerable attention recently. We observe that on a specific multi‐GPU platform, the SpMV performance can usually be greatly improved when a matrix is partitioned into several blocks according to a predetermined rule and each block is assigned to a GPU with an appropriate storage format. This motivates us to propose a novel multi‐GPU parallel SpMV optimization framework, which involves the following parts: (1) a simple rule is defined to divide any given matrix among multiple GPUs; (2) a performance model, which is independent of the problems and dependent on the resources of devices, is proposed to accurately predict the execution time of SpMV kernels; and (3) a selection algorithm is suggested to automatically select the most appropriate one from the storage formats that are involved in the framework for the matrix block that is assigned to each GPU on the basis of the performance model. The objective of our framework does not construct a new storage format or algorithm but automatically and rapidly generates an optimally parallel SpMV for any sparse matrix on a specific multi‐GPU platform by integrating the existing storage formats and their corresponding kernels. We take 5 popular storage formats, for example, to present the idea of constructing the framework. Theoretically, we validate the correctness of our proposed SpMV performance model. This model is constructed only once for each type of GPU. Moreover, this framework is general and easy to be extensible. For a storage format that is not included in our framework, once the performance model of its corresponding SpMV kernel is successfully constructed, it can be incorporated into our framework. The experiments validate the efficiency of our proposed framework.

show abstract

An Extended Compression Format for the Optimization of Sparse Matrix-Vector Multiplication

Cited by 42 publications

References 27 publications

Improving the Performance of the Symmetric Sparse Matrix-Vector Multiplication in Multicore

Improving the Performance of the Symmetric Sparse Matrix-Vector Multiplication in Multicore

FPGA Implementation of Latency, Computational time Improvements in Matrix Multiplication

A novel multi–graphics processing unit parallel optimization framework for the sparse matrix‐vector multiplication

Contact Info

Product

Resources

About