2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) 2018
DOI: 10.1109/pdp2018.2018.00065
|View full text |Cite
|
Sign up to set email alerts
|

Variable Batched DGEMM

Abstract: Many scientific applications are in need to solve a high number of small-size independent problems. However, these individual problems do not provide enough parallelism to efficiently exploit the current parallel architectures, and then, these must be computed as a batch in order to saturate the hardware. Today, vendors such as Intel and NVIDIA are developing their own suite of batch routines. Although most of the works focus on batches of fixed size, in real applications we can not assume a uniform size for a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
4
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
1
1

Relationship

3
6

Authors

Journals

citations
Cited by 18 publications
(6 citation statements)
references
References 10 publications
(15 reference statements)
0
4
0
Order By: Relevance
“…There are numerous research efforts on improving the computational efficiency of MM (e.g. [17,18,[18][19][20][21][22]). However, none of them can be readily adapted to optimize the computation efficiency of MM in the context of HE computation.…”
Section: Related Workmentioning
confidence: 99%
“…There are numerous research efforts on improving the computational efficiency of MM (e.g. [17,18,[18][19][20][21][22]). However, none of them can be readily adapted to optimize the computation efficiency of MM in the context of HE computation.…”
Section: Related Workmentioning
confidence: 99%
“…Compared to the works that propose different strategies for optimizing the GEMM operation [4, 8, 11, 13-17, 19, 23, 26, 30], our work is the first to present highly optimized task-based implementations of the GEMM routine. On top of that, although distinct works have evaluated the execution of task-based versions of the GEMM routine [22,24,28,31,32], none of them (i) implement a highly optimized task-based version; and (ii) propose a heuristic to select the best parallelization scheme and parameters; as we do in this work.…”
Section: Related Workmentioning
confidence: 99%
“…Finally, aiming to keep using coarse tasks but trying to adapt to the different amount of non-zero elements per row, we propose to apply the grouping approach of Valero-Lara et al [16,22]. In this case, we create groups of rows according to a limit (given by the architecture, e.g.…”
Section: Groupingmentioning
confidence: 99%