2012 IEEE 26th International Parallel and Distributed Processing Symposium 2012
DOI: 10.1109/ipdps.2012.11
|View full text |Cite
|
Sign up to set email alerts
|

A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
22
0

Year Published

2013
2013
2022
2022

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 40 publications
(22 citation statements)
references
References 11 publications
0
22
0
Order By: Relevance
“…In Table 2, we empirically benchmark the bandwidth of the global memory and shared memory, again using benchmarks described in [10]. 2 Our global memory bandwidth results are for memory accesses with unit stride-adjacent threads access adjacent global memory addresses.…”
Section: Benchmarking the Memory Hierarchymentioning
confidence: 99%
“…In Table 2, we empirically benchmark the bandwidth of the global memory and shared memory, again using benchmarks described in [10]. 2 Our global memory bandwidth results are for memory accesses with unit stride-adjacent threads access adjacent global memory addresses.…”
Section: Benchmarking the Memory Hierarchymentioning
confidence: 99%
“…Moreover, there are good reasons to believe that neither improved compiler technology nor autotuning will make any significant headway on this problem. This lack of coverage by current library infrastructure is especially alarming because of the number of applications from important fields that fit this profile, including deep learning [8], data mining [31], astrophysics [23], image and signal processing [4], [24], hydrodynamics [10], quantum chemistry [5], and computational fluid dynamics (CFD) and the resulting partial differential equations (PDEs) through direct and multifrontal solvers [42], to name a few. Dramatically better performance on these applications can be achieved by using software that can repetitively execute small matrix/tensor operations grouped together in "batches."…”
Section: Introductionmentioning
confidence: 99%
“…Also, in combustion and astrophysics supernova applications [6], [7], [17], [23], [32], the study of a thermonuclear reaction networks (XNet package) requires the solution of many sparse linear systems of around 150 × 150. Furthermore, the need for batched routines can be illustrated in radar signal processing [4], where a batch of 200 × 200 QR decompositions is needed, as well as in hydrodynamic simulations [10], where thousands of matrix-matrix and matrix-vector (GEMV) products of matrices of around 100 × 100 are needed.…”
Section: Introductionmentioning
confidence: 99%
“…In Magnetic resonance imaging (MRI), billions small 8x8 and 32x32 eigenvalue problems need to be solved. Also, a batched 200x200 QR decomposition is required to be computed in radar signal processing [3]. Hydrodynamic simulations need to compute thousands of matrix-matrix (dgemm) or matrix-vector(dgemv) products of matrices of well over 100x100 [6].…”
Section: Introductionmentioning
confidence: 99%