Abstract:We analyze the problem of sparse-matrix dense-vector multiplication (SpMV) in the I/O model. In the SpMV, the objective is to compute y = Ax, where A is a sparse matrix and x and y are vectors. We give tight upper and lower bounds on the number of block transfers as a function of the sparsity k, the number of nonzeros in a column of A.Parameter k is a knob that bridges the problems of permuting (k = 1) and dense matrix multiplication (k = N ). When the nonzero elements of A are stored in column-major order,
“…Then I(A, x) = 1, 2, 3, 4, 9, 10, 11, 12, 5, 6, 7, 8, 13, 14, 15, 16 is a run with the stripes I(A, x) = (1, 2, 3, 4), (9,10,11,12), (5,6,7,8), (13,14,15,16).…”
Section: Definition 2 (Runs) a Sequence Of Memory Locations Is Callementioning
confidence: 99%
“…In the first phase we read the contents of run (1, 2, 3, 4) and write it into run (5,6,7,8). In the second phase the contents of run (9,10,11,12) is written into run (13,14,15,16). Each run here consists of a single stripe and we read from exactly one run at each phase.…”
Section: (And the Runs Are Defined At The Beginning Of A Phase); And mentioning
Energy consumption has emerged as first class computing resource for both server systems and personal computing devices. The growing importance of energy has led to rethink in hardware design, hypervisors, operating systems and compilers. Algorithm design is still relatively untouched by the importance of energy and algorithmic complexity models do not capture the energy consumed by an algorithm.In this paper, we propose a new complexity model to account for the energy used by an algorithm. Based on an abstract memory model (which was inspired by the popular DDR3 memory model and is similar to the parallel disk I/O model of Vitter and Shriver), we present a simple energy model that is a (weighted) sum of the time complexity of the algorithm and the number of "parallel" I/O accesses made by the algorithm. We derive this simple model from a more complicated model that better models the ground truth and present some experimental justification for our model. We believe that the simplicity (and applicability) of this energy model is the main contribution of the paper.We present some sufficient conditions on algorithm behavior that allows us to bound the energy complexity of the algorithm in terms of its time complexity (in the RAM model) and its I/O complexity (in the I/O model). As corollaries, we obtain energy optimal algorithms for sorting (and its special cases like permutation), matrix transpose and (sparse) matrix vector multiplication.
“…Then I(A, x) = 1, 2, 3, 4, 9, 10, 11, 12, 5, 6, 7, 8, 13, 14, 15, 16 is a run with the stripes I(A, x) = (1, 2, 3, 4), (9,10,11,12), (5,6,7,8), (13,14,15,16).…”
Section: Definition 2 (Runs) a Sequence Of Memory Locations Is Callementioning
confidence: 99%
“…In the first phase we read the contents of run (1, 2, 3, 4) and write it into run (5,6,7,8). In the second phase the contents of run (9,10,11,12) is written into run (13,14,15,16). Each run here consists of a single stripe and we read from exactly one run at each phase.…”
Section: (And the Runs Are Defined At The Beginning Of A Phase); And mentioning
Energy consumption has emerged as first class computing resource for both server systems and personal computing devices. The growing importance of energy has led to rethink in hardware design, hypervisors, operating systems and compilers. Algorithm design is still relatively untouched by the importance of energy and algorithmic complexity models do not capture the energy consumed by an algorithm.In this paper, we propose a new complexity model to account for the energy used by an algorithm. Based on an abstract memory model (which was inspired by the popular DDR3 memory model and is similar to the parallel disk I/O model of Vitter and Shriver), we present a simple energy model that is a (weighted) sum of the time complexity of the algorithm and the number of "parallel" I/O accesses made by the algorithm. We derive this simple model from a more complicated model that better models the ground truth and present some experimental justification for our model. We believe that the simplicity (and applicability) of this energy model is the main contribution of the paper.We present some sufficient conditions on algorithm behavior that allows us to bound the energy complexity of the algorithm in terms of its time complexity (in the RAM model) and its I/O complexity (in the I/O model). As corollaries, we obtain energy optimal algorithms for sorting (and its special cases like permutation), matrix transpose and (sparse) matrix vector multiplication.
“…The second algorithmic direction strives to achieve optimal theoretical I/O complexity by using cacheoblivious algorithms [3]. From a high-level view, Bender's algorithm first generates all the intermediate triples of the output vector y, possibly with repeating indices.…”
Abstract-On multicore architectures, the ratio of peak memory bandwidth to peak floating-point performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bitmasked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fill-in zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined benefits of bitmasked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:flop ratio and larger sparse matrices) continue.
“…Bender et al [5] extended the sequential communication lower bounds introduced in [14] to sparse matrix vector multiplication. This lower bound is relevant to our analysis of Krylov subspace methods, which essentially perform repeated sparse matrix vector multiplications.…”
Section: Previous Workmentioning
confidence: 99%
“…This lower bound is relevant to our analysis of Krylov subspace methods, which essentially perform repeated sparse matrix vector multiplications. However, [5] used a sequential memory hierarchy model and established bounds in terms of memory size and track (cacheline) size, while we focus on interprocessor communication.…”
This paper derives tradeoffs between three basic costs of a parallel algorithm: synchronization, data movement, and computational cost. Our theoretical model counts the amount of work and data movement as a maximum of any execution path during the parallel computation. By considering this metric, rather than the total communication volume over the whole machine, we obtain new insight into the characteristics of parallel schedules for algorithms with non-trivial dependency structures. The tradeoffs we derive are lower bounds on the execution time of the algorithm which are independent of the number of processors, but dependent on the problem size. Therefore, these tradeoffs provide lower bounds on the parallel execution time of any algorithm computed by a system composed of any number of homogeneous components each with associated computational, communication, and synchronization payloads. We first state our results for general graphs, based on expansion parameters, then we apply the theorem to a number of specific algorithms in numerical linear algebra, namely triangular substitution, Gaussian elimination, and Krylov subspace methods. Our lower bound for LU factorization demonstrates the optimality of Tiskin's LU algorithm [24] answering an open question posed in his paper, as well as of the 2.5D LU [20] algorithm which has analogous costs. We treat the computations in a general manner by noting that the computations share a similar dependency hypergraph structure and analyzing the communication requirements of lattice hypergraph structures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.