OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing

Lim, Roktaek; Lee, Yeongha; Kim, Raehyun; Choi, Jaeyoung

doi:10.1145/3176364.3176374

Cited by 9 publications

(4 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CPU/GPU heterogeneous parallel programming model is based on a heterogeneous computing platform where computing power involving both GPUs and CPUs is considered [52]. OpenMP supports multi-threaded concurrent execution of tasks on multi-core CPUs [53]. The independence of CPU cores allows different tasks to be performed simultaneously among different OpenMP threads.…”

Section: Cpu/gpu Heterogeneous Computing 231 Gpu Parallel Architecturementioning

confidence: 99%

A Hybrid Parallel Strategy for Isogeometric Topology Optimization via CPU/GPU Heterogeneous Computing

Xia,

Gao,

et al. 2024

CMES

View full text Add to dashboard Cite

This paper aims to solve large-scale and complex isogeometric topology optimization problems that consume significant computational resources. A novel isogeometric topology optimization method with a hybrid parallel strategy of CPU/GPU is proposed, while the hybrid parallel strategies for stiffness matrix assembly, equation solving, sensitivity analysis, and design variable update are discussed in detail. To ensure the high efficiency of CPU/GPU computing, a workload balancing strategy is presented for optimally distributing the workload between CPU and GPU. To illustrate the advantages of the proposed method, three benchmark examples are tested to verify the hybrid parallel strategy in this paper. The results show that the efficiency of the hybrid method is faster than serial CPU and parallel GPU, while the speedups can be up to two orders of magnitude.

show abstract

Section: Cpu/gpu Heterogeneous Computing 231 Gpu Parallel Architecturementioning

confidence: 99%

A Hybrid Parallel Strategy for Isogeometric Topology Optimization via CPU/GPU Heterogeneous Computing

Xia,

Gao,

et al. 2024

CMES

View full text Add to dashboard Cite

show abstract

“…Jiang et al [16] propose a three-level blocking DGEMM algorithm to improve data-locality in the Sunway TaihuLight supercomputer. Lim et al [19] optimize a DGEMM OpenMP fork-join version by choosing the proper block size and thread affinity to the Intel Xeon Phi. Abdelfattah et al [4] propose HGEMM to improve the performance in GPU Tensor Cores.…”

Section: Related Workmentioning

confidence: 99%

Seamless optimization of the GEMM kernel for task-based programming models

Lorenzon

Marques

Navarro

et al. 2022

Proceedings of the 36th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

The general matrix-matrix multiplication (GEMM) kernel is a fundamental building block of many scientific applications. Many libraries such as Intel MKL and BLIS provide highly optimized sequential and parallel versions of this kernel. The parallel implementations of the GEMM kernel rely on the well-known fork-join execution model to exploit multi-core systems efficiently. However, these implementations are not well suited for task-based applications as they break the data-flow execution model. In this paper, we present a task-based implementation of the GEMM kernel that can be seamlessly leveraged by task-based applications while providing better performance than the fork-join version. Our implementation leverages several advanced features of the OmpSs-2 programming model and a new heuristic to select the best parallelization strategy and blocking parameters based on the matrix and hardware characteristics. When evaluating the performance and energy consumption on two modern multi-core systems, we show that our implementations provide significant performance improvements over an optimized OpenMP fork-join implementation, and can beat vendor implementations of the GEMM (e.g., Intel MKL and AMD AOCL). We also demonstrate that a real application can leverage our optimized task-based implementation to enhance performance. CCS CONCEPTS• Computing methodologies → Massively parallel algorithms; • Theory of computation → Parallel computing models.

show abstract

“…For nodes that support hyperthreading, the granularity modifier specifies whether to pin OpenMP threads to physical cores (granularity=core) or logical cores (granu-larity=fine). Using granularity=thread enables distribution of OpenMP threads in a compact and or scatter fashion [26]. For this work KMP_AFFINITY = granularity = fine was used as it prevented Matlab/Octave from over-allocating OpenMP threads to the same processor core as determined by monitoring the compute node with the Linux htop command during execution.…”

Section: Openmpmentioning

confidence: 99%

Optimizing Xeon Phi for Interactive Data Analysis

Byun

Klein

Milechin³

et al. 2019

2019 IEEE High Performance Extreme Computing Conference (HPEC)

View full text Add to dashboard Cite

The Intel Xeon Phi manycore processor is designed to provide high performance matrix computations of the type often performed in data analysis. Common data analysis environments include Matlab, GNU Octave, Julia, Python, and R. Achieving optimal performance of matrix operations within data analysis environments requires tuning the Xeon Phi OpenMP settings, process pinning, and memory modes. This paper describes matrix multiplication performance results for Matlab and GNU Octave over a variety of combinations of process counts and OpenMP threads and Xeon Phi memory modes. These results indicate that using KMP_AFFINITY=granlarity=fine, taskset pinning, and all2all cache memory mode allows both Matlab and GNU Octave to achieve 66% of the practical peak performance for process counts ranging from 1 to 64 and OpenMP threads ranging from 1 to 64. These settings have resulted in generally improved performance across a range of applications and has enabled our Xeon Phi system to deliver significant results in a number of real-world applications.

show abstract

OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing

Cited by 9 publications

References 9 publications

A Hybrid Parallel Strategy for Isogeometric Topology Optimization via CPU/GPU Heterogeneous Computing

A Hybrid Parallel Strategy for Isogeometric Topology Optimization via CPU/GPU Heterogeneous Computing

Seamless optimization of the GEMM kernel for task-based programming models

Optimizing Xeon Phi for Interactive Data Analysis

Contact Info

Product

Resources

About