A framework for practical parallel fast matrix multiplication

Benson, Austin R.; Ballard, Grey

doi:10.1145/2688500.2688513

Cited by 27 publications

(25 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In our experiments, we use the effective Gfops (Giga field operations per second) metric, also used in [12,25,2] defined as Gfops = # of field ops using classic matrix product time .…”

Section: Methodology Of Experimentsmentioning

confidence: 99%

See 1 more Smart Citation

Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

et al. 2016

View full text Add to dashboard Cite

We present block algorithms and their implementation for the parallelization of sub-cubic Gaussian elimination on shared memory architectures. Contrarily to the classical cubic algorithms in parallel numerical linear algebra, we focus here on recursive algorithms and coarse grain parallelization. Indeed, sub-cubic matrix arithmetic can only be achieved through recursive algorithms making coarse grain block algorithms perform more efficiently than fine grain ones. This work is motivated by the design and implementation of dense linear algebra over a finite field, where fast matrix multiplication is used extensively and where costly modular reductions also advocate for coarse grain block decomposition. We incrementally build efficient kernels, for matrix multiplication first, then triangular system solving, on top of which a recursive PLUQ decomposition algorithm is built. We study the parallelization of these kernels using several algorithmic variants: either iterative or recursive and using different splitting strategies. Experiments show that recursive adaptive methods for matrix multiplication, hybrid recursive-iterative methods for triangular system solve and tile recursive versions of the PLUQ decomposition, together with various data mapping policies, provide the best performance on a 32 cores NUMA architecture. Overall, we show that the overhead of modular reductions is more than compensated by the fast linear algebra algorithms and that exact dense linear algebra matches the performance of full rank reference numerical software even in the presence of rank deficiencies.

show abstract

“…In our experiments, we use the effective Gfops (Giga field operations per second) metric, also used in [12,25,2] defined as Gfops = # of field ops using classic matrix product time .…”

Section: Methodology Of Experimentsmentioning

confidence: 99%

“…In particular, numerical linear algebra based on Strassen's algorithm (if numerical stability issues have been considered acceptable) should clearly benefit from most of its results. Related work on the parallelization of the sub-cubic numerical linear algebra include [1,24,6,25,2].…”

Section: Introductionmentioning

confidence: 99%

Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

et al. 2016

View full text Add to dashboard Cite

show abstract

“…Both of the 1 In this paper, we distinguish the sorting network and the merging network. 2 We use the Integer datatype as the representative of the one-word type, and the Double datatype for the two-word type.…”

Section: Intel Mic Vector Architecturementioning

confidence: 99%

“…Mint and Physis in [26,16] can generate e↵ective GPU codes for stencil computations. Benson et al [2] provides a code generation tool to automatically implement various matrix multiplication algorithms. To facilitate the utilization of the intra-core resources, Huo et al [12] presents a system with runtime SIMD parallelization with override operators and functions.…”

Section: Related Workmentioning

confidence: 99%

ASPaS

Hou

Wang

Feng

2015

Proceedings of the 29th ACM on International Conference on Supercomputing

View full text Add to dashboard Cite

Due to the di culty that modern compilers have in vectorizing applications on vector-extension architectures, programmers resort to manually programming vector registers with intrinsics in order to achieve better performance. However, the continued growth in the width of registers and the evolving library of intrinsics make such manual optimizations tedious and error-prone. Hence, we propose a framework for the Automatic S IMDization of Parallel S orting (ASPaS ) on x86-based multicore and manycore processors. That is, ASPaS takes any sorting network and a given instruction set architecture (ISA) as inputs and automatically generates vectorized code for that sorting network.By formalizing the sort function as a sequence of comparators and the transpose and merge functions as sequences of vector-matrix multiplications, ASPaS can map these functions to operations from a selected "pattern pool" that is based on the characteristics of parallel sorting, and then generate the vectorized code with the real ISA intrinsics. The performance evaluation of our ASPaS framework on the Intel Xeon Phi coprocessor illustrates that automatically generated sorting codes from ASPaS can outperform the sorting implementations from STL, Boost, and Intel TBB.

show abstract

“…From a practical perspective, it is unlikely that the techniques for obtaining the best upper bounds on the exponent can be translated to practical algorithms that will execute faster than the classical one for reasonably sized matrices. In this paper, we are interested in the numerical stability of practical algorithms that have been demonstrated to outperform the classical algorithm (as well as Strassen's in some instances) on modern hardware [3].…”

mentioning

confidence: 99%

Improving the Numerical Stability of Fast Matrix Multiplication

Ballard¹,

Benson²,

Druinsky³

et al. 2016

SIAM J. Matrix Anal. & Appl.

Self Cite

View full text Add to dashboard Cite

Fast algorithms for matrix multiplication, namely those that perform asymptotically fewer scalar operations than the classical algorithm, have been considered primarily of theoretical interest. Apart from Strassen's original algorithm, few fast algorithms have been efficiently implemented or used in practical applications. However, there exist many practical alternatives to Strassen's algorithm with varying performance and numerical properties. Fast algorithms are known to be numerically stable, but because their error bounds are slightly weaker than the classical algorithm, they are not used even in cases where they provide a performance benefit.We argue in this paper that the numerical sacrifice of fast algorithms, particularly for the typical use cases of practical algorithms, is not prohibitive, and we explore ways to improve the accuracy both theoretically and empirically. The numerical accuracy of fast matrix multiplication depends on properties of the algorithm and of the input matrices, and we consider both contributions independently. We generalize and tighten previous error analyses of fast algorithms and compare their properties. We discuss algorithmic techniques for improving the error guarantees from two perspectives: manipulating the algorithms, and reducing input anomalies by various forms of diagonal scaling. Finally, we benchmark performance and demonstrate our improved numerical accuracy.

show abstract

A framework for practical parallel fast matrix multiplication

Cited by 27 publications

References 33 publications

Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

ASPaS

Improving the Numerical Stability of Fast Matrix Multiplication

Contact Info

Product

Resources

About