Variable Batched DGEMM

Valero-Lara, Pedro; Martínez-Pérez, Ivan; Mateo, Sergi; Sirvent, Raül; Beltrán, Vicenç; Martorell, Xavier; Labarta, Jesús

doi:10.1109/pdp2018.2018.00065

Cited by 18 publications

(6 citation statements)

References 10 publications

(15 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are numerous research efforts on improving the computational efficiency of MM (e.g. [17,18,[18][19][20][21][22]). However, none of them can be readily adapted to optimize the computation efficiency of MM in the context of HE computation.…”

Section: Related Workmentioning

confidence: 99%

Secure and Efficient General Matrix Multiplication On Cloud Using Homomorphic Encryption

Gao,

Gang,

Homsi

et al. 2024

Preprint

View full text Add to dashboard Cite

Despite the enormous technical and financial advantages of cloud computing, security and privacy have always been the primary concerns for adopting cloud computing facilities, especially for government agencies and commercial sectors with high-security requirements. Homomorphic Encryption (HE) has recently emerged as an effective tool in ensuring privacy and security for sensitive applications by allowing computing on encrypted data. One major obstacle to employing HE-based computation, however, is its excessive computational cost, which can be orders of magnitude higher than its counterpart based on the plaintext. In this paper, we study the problem of how to reduce the HE-based computational cost for general Matrix Multiplication (MM), i.e., a fundamental building block for numerous practical applications, by taking advantage of the Single Instruction Multiple Data (SIMD) operations supported by HE schemes. Specifically, we develop a novel element-wise algorithm for general matrix multiplication, based on which we propose two HE-based General Matrix Multiplication (HEGMM) Approved for Public Release on 06 Mar 2024. Distribution is Unlimited. Case Number: 2024-0184 (original case number(s): AFRL-2024-0944) algorithms to reduce the HE computation cost. Our experimental results show that our algorithms can significantly outperform the state-of-the-art approaches of HE-based matrix multiplication.

show abstract

Section: Related Workmentioning

confidence: 99%

Secure and Efficient General Matrix Multiplication On Cloud Using Homomorphic Encryption

Gao,

Gang,

Homsi

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…Compared to the works that propose different strategies for optimizing the GEMM operation [4, 8, 11, 13-17, 19, 23, 26, 30], our work is the first to present highly optimized task-based implementations of the GEMM routine. On top of that, although distinct works have evaluated the execution of task-based versions of the GEMM routine [22,24,28,31,32], none of them (i) implement a highly optimized task-based version; and (ii) propose a heuristic to select the best parallelization scheme and parameters; as we do in this work.…”

Section: Related Workmentioning

confidence: 99%

Seamless optimization of the GEMM kernel for task-based programming models

Lorenzon

Marques

Navarro

et al. 2022

Proceedings of the 36th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

The general matrix-matrix multiplication (GEMM) kernel is a fundamental building block of many scientific applications. Many libraries such as Intel MKL and BLIS provide highly optimized sequential and parallel versions of this kernel. The parallel implementations of the GEMM kernel rely on the well-known fork-join execution model to exploit multi-core systems efficiently. However, these implementations are not well suited for task-based applications as they break the data-flow execution model. In this paper, we present a task-based implementation of the GEMM kernel that can be seamlessly leveraged by task-based applications while providing better performance than the fork-join version. Our implementation leverages several advanced features of the OmpSs-2 programming model and a new heuristic to select the best parallelization strategy and blocking parameters based on the matrix and hardware characteristics. When evaluating the performance and energy consumption on two modern multi-core systems, we show that our implementations provide significant performance improvements over an optimized OpenMP fork-join implementation, and can beat vendor implementations of the GEMM (e.g., Intel MKL and AMD AOCL). We also demonstrate that a real application can leverage our optimized task-based implementation to enhance performance. CCS CONCEPTS• Computing methodologies → Massively parallel algorithms; • Theory of computation → Parallel computing models.

show abstract

“…Finally, aiming to keep using coarse tasks but trying to adapt to the different amount of non-zero elements per row, we propose to apply the grouping approach of Valero-Lara et al [16,22]. In this case, we create groups of rows according to a limit (given by the architecture, e.g.…”

Section: Groupingmentioning

confidence: 99%

Towards an Auto-Tuned and Task-Based SpMV (LASs Library)

Catalán

Usui

Toledo

et al. 2020

OpenMP: Portable Multi-Level Parallelism on Modern Systems

Self Cite

View full text Add to dashboard Cite

We present a novel approach to parallelize the SpMV kernel included in LASs (Linear Algebra routines on OmpSs) library, after a deep review and analysis of several well-known approaches. LASs is based on OmpSs, a task-based runtime that extends OpenMP directives, providing more flexibility to apply new strategies. Based on tasking and nesting, with the aim of improving the workload imbalance inherent to the SpMV operation, we present a strategy especially useful for highly imbalanced input matrices. In this approach, the number of created tasks is dynamically decided in order to maximize the use of the resources of the platform. Throughout this paper, SpMV behavior depending on the selected strategy (state of the art and proposed strategies) is deeply analyzed, setting in this way the base for a future auto-tunable code that is able to select the most suitable approach depending on the input matrix. The experiments of this work were carried out for a set of 12 matrices from the Suite Sparse Matrix Collection, all of them with different characteristics regarding their sparsity. The experiments of this work were performed on a node of Marenostrum 4 supercomputer (with two sockets Intel Xeon, 24 cores each) and on a node of Dibona cluster (using one ARM ThunderX2 socket with 32 cores). Our tests show that, for Intel Xeon, the best parallelization strategy reduces the execution time of the reference MKL multi-threaded version up to 67%. On ARM ThunderX2, the reduction is up to 56% with respect to the OmpSs parallel reference.

show abstract

Variable Batched DGEMM

Cited by 18 publications

References 10 publications

Secure and Efficient General Matrix Multiplication On Cloud Using Homomorphic Encryption

Secure and Efficient General Matrix Multiplication On Cloud Using Homomorphic Encryption

Seamless optimization of the GEMM kernel for task-based programming models

Towards an Auto-Tuned and Task-Based SpMV (LASs Library)

Contact Info

Product

Resources

About