Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model

Bender, Michael A.; Brodal, Gerth Stølting; Fagerberg, Rolf; Jacob, Riko; Vicari, Elias

doi:10.1007/s00224-010-9285-4

Cited by 37 publications

(58 citation statements)

References 12 publications

(10 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Then I(A, x) = 1, 2, 3, 4, 9, 10, 11, 12, 5, 6, 7, 8, 13, 14, 15, 16 is a run with the stripes I(A, x) = (1, 2, 3, 4), (9,10,11,12), (5,6,7,8), (13,14,15,16).…”

Section: Definition 2 (Runs) a Sequence Of Memory Locations Is Callementioning

confidence: 99%

See 1 more Smart Citation

An energy complexity model for algorithms

Roy

Rudra

Verma

2013

Proceedings of the 4th Conference on Innovations in Theoretical Computer Science

View full text Add to dashboard Cite

Energy consumption has emerged as first class computing resource for both server systems and personal computing devices. The growing importance of energy has led to rethink in hardware design, hypervisors, operating systems and compilers. Algorithm design is still relatively untouched by the importance of energy and algorithmic complexity models do not capture the energy consumed by an algorithm.In this paper, we propose a new complexity model to account for the energy used by an algorithm. Based on an abstract memory model (which was inspired by the popular DDR3 memory model and is similar to the parallel disk I/O model of Vitter and Shriver), we present a simple energy model that is a (weighted) sum of the time complexity of the algorithm and the number of "parallel" I/O accesses made by the algorithm. We derive this simple model from a more complicated model that better models the ground truth and present some experimental justification for our model. We believe that the simplicity (and applicability) of this energy model is the main contribution of the paper.We present some sufficient conditions on algorithm behavior that allows us to bound the energy complexity of the algorithm in terms of its time complexity (in the RAM model) and its I/O complexity (in the I/O model). As corollaries, we obtain energy optimal algorithms for sorting (and its special cases like permutation), matrix transpose and (sparse) matrix vector multiplication.

show abstract

“…Then I(A, x) = 1, 2, 3, 4, 9, 10, 11, 12, 5, 6, 7, 8, 13, 14, 15, 16 is a run with the stripes I(A, x) = (1, 2, 3, 4), (9,10,11,12), (5,6,7,8), (13,14,15,16).…”

Section: Definition 2 (Runs) a Sequence Of Memory Locations Is Callementioning

confidence: 99%

“…In the first phase we read the contents of run (1, 2, 3, 4) and write it into run (5,6,7,8). In the second phase the contents of run (9,10,11,12) is written into run (13,14,15,16). Each run here consists of a single stripe and we read from exactly one run at each phase.…”

Section: (And the Runs Are Defined At The Beginning Of A Phase); And mentioning

confidence: 99%

An energy complexity model for algorithms

Roy

Rudra

Verma

2013

Proceedings of the 4th Conference on Innovations in Theoretical Computer Science

View full text Add to dashboard Cite

show abstract

“…The second algorithmic direction strives to achieve optimal theoretical I/O complexity by using cacheoblivious algorithms [3]. From a high-level view, Bender's algorithm first generates all the intermediate triples of the output vector y, possibly with repeating indices.…”

Section: Related Workmentioning

confidence: 99%

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

Buluç

Williams

Oliker

et al. 2011

2011 IEEE International Parallel &Amp; Distributed Processing Symposium

109

View full text Add to dashboard Cite

Abstract-On multicore architectures, the ratio of peak memory bandwidth to peak floating-point performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bitmasked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fill-in zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined benefits of bitmasked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:flop ratio and larger sparse matrices) continue.

show abstract

“…Bender et al [5] extended the sequential communication lower bounds introduced in [14] to sparse matrix vector multiplication. This lower bound is relevant to our analysis of Krylov subspace methods, which essentially perform repeated sparse matrix vector multiplications.…”

Section: Previous Workmentioning

confidence: 99%

“…This lower bound is relevant to our analysis of Krylov subspace methods, which essentially perform repeated sparse matrix vector multiplications. However, [5] used a sequential memory hierarchy model and established bounds in terms of memory size and track (cacheline) size, while we focus on interprocessor communication.…”

Section: Previous Workmentioning

confidence: 99%

Tradeoffs Between Synchronization, Communication, and Work in Parallel Linear Algebra Computations

Solomonik¹,

Carson²,

Knight³

et al. 2014

View full text Add to dashboard Cite

This paper derives tradeoffs between three basic costs of a parallel algorithm: synchronization, data movement, and computational cost. Our theoretical model counts the amount of work and data movement as a maximum of any execution path during the parallel computation. By considering this metric, rather than the total communication volume over the whole machine, we obtain new insight into the characteristics of parallel schedules for algorithms with non-trivial dependency structures. The tradeoffs we derive are lower bounds on the execution time of the algorithm which are independent of the number of processors, but dependent on the problem size. Therefore, these tradeoffs provide lower bounds on the parallel execution time of any algorithm computed by a system composed of any number of homogeneous components each with associated computational, communication, and synchronization payloads. We first state our results for general graphs, based on expansion parameters, then we apply the theorem to a number of specific algorithms in numerical linear algebra, namely triangular substitution, Gaussian elimination, and Krylov subspace methods. Our lower bound for LU factorization demonstrates the optimality of Tiskin's LU algorithm [24] answering an open question posed in his paper, as well as of the 2.5D LU [20] algorithm which has analogous costs. We treat the computations in a general manner by noting that the computations share a similar dependency hypergraph structure and analyzing the communication requirements of lattice hypergraph structures.

show abstract

Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model

Cited by 37 publications

References 12 publications

An energy complexity model for algorithms

An energy complexity model for algorithms

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

Tradeoffs Between Synchronization, Communication, and Work in Parallel Linear Algebra Computations

Contact Info

Product

Resources

About