Practical Implementation of Lattice QCD Simulation on SIMD Machines with Intel AVX-512

Kanamori, Issaku; Matsufuru, Hideo

doi:10.1007/978-3-319-95168-3_31

Cited by 12 publications

(13 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…] which is written in C++ based on the object-oriented design. Bridge++ has been used to investigate a recipe of tuning on Intel AVX-512 architectures [21,22].…”

Section: Multi-grid Algorithmmentioning

confidence: 99%

“…For the smaller local volume case, the domain-decomposed operator is faster than the full operator as expected from the absence of communications. The tuning described in [22] for full operator works quite efficiently for a larger local volume so that its performance exceeds Table 1. Elapsed time for the multi-grid solver.…”

Section: Performance On Intel Xeon Phi Clustermentioning

confidence: 99%

“…The implementation of the fine grid operators inherits the previous works [21,22] and uses the L2 prefetch in the full operator D. Although it has been tuned for D, the same prefetching pattern is used in the SAP operator.…”

Section: Implementation For Intel Avx-512 Architecturementioning

confidence: 99%

See 2 more Smart Citations

Object-Oriented Implementation of Algebraic Multi-grid Solver for Lattice QCD on SIMD Architectures and GPU Clusters

Kanamori¹,

Ishikawa²,

Matsufuru³

2021

Computational Science and Its Applications – ICCSA 2021

Self Cite

View full text Add to dashboard Cite

A portable implementation of elaborated algorithm is important to use variety of architectures in HPC applications. In this work we implement and benchmark an algebraic multi-grid solver for Lattice QCD on three different architectures, Intel Xeon Phi, Fujitsu A64FX, and NVIDIA Tesla V100, in keeping high performance and portability of the code based on the object-oriented paradigm. Some parts of code are specific to an architecture employing appropriate data layout and tuned matrix-vector multiplication kernels, while the implementation of abstract solver algorithm is common to all architectures. Although the performance of the solver depends on tuning of the architecture-dependent part, we observe reasonable scaling behavior and better performance than the mixed precision BiCGSstab solvers.

show abstract

“…] which is written in C++ based on the object-oriented design. Bridge++ has been used to investigate a recipe of tuning on Intel AVX-512 architectures [21,22].…”

Section: Multi-grid Algorithmmentioning

confidence: 99%

Section: Performance On Intel Xeon Phi Clustermentioning

confidence: 99%

See 1 more Smart Citation

Object-Oriented Implementation of Algebraic Multi-grid Solver for Lattice QCD on SIMD Architectures and GPU Clusters

Kanamori¹,

Ishikawa²,

Matsufuru³

2021

Computational Science and Its Applications – ICCSA 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…This requires that the lattice size in x-direction must be a multiple of 8. The details of the tuning with the AVX-512 instruction set were presented in [7]. For Armv8.2-A-SVE, we adopt a different packing: as depicted in the right panel of Fig.…”

Section: Simd Architectures: Intel Avx-512 and Fujitsu A64fxmentioning

confidence: 99%

“…Recent supercomputers, however, adopt a variety of architecture: multi-core parallel machines with wide SIMD (A64FX and Intel processors), and clusters with accelerator devices such as GPUs, PEZY-SC, and vector processors (NEC SX-Aurora). Soon after the first public release of Bridge++ in 2012 [2], we had started to investigate possible extensions of our code to exploit these new architectures while keeping the readability and portability, as well as to develop tuning techniques for them [3,4,5,6,7,8]. Recently we have constructed a framework to incorporate the tuned codes as an alternative part to the previously developed Bridge++ code, and decided to release it as version 2.0.…”

Section: Introductionmentioning

confidence: 99%

General purpose lattice QCD code set Bridge++ 2.0 for high performance computing

Akahoshi,

Aoki,

Aoyama

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Bridge++ is a general-purpose code set for a numerical simulation of lattice QCD aiming at a readable, extensible, and portable code while keeping practically high performance. The previous version of Bridge++ is implemented in double precision with a fixed data layout. To exploit the high arithmetic capability of new processor architecture, we extend the Bridge++ code so that optimized code is available as a new branch, i.e., an alternative to the original code. This paper explains our strategy of implementation and displays application examples to the following architectures and systems: Intel AVX-512 on Xeon Phi Knights Landing, Arm A64FX-SVE on Fujitsu A64FX (Fugaku), NEC SX-Aurora TSUBASA, and GPU cluster with NVIDIA V100.1 https://bridge.kek.jp/Lattice-code/ 2 Basics of lattice QCD are covered by many text textbooks, e.g., [1].

show abstract

Large-Scale Parallelization of Lattice QCD on Sunway TaihuLight Supercomputer

Luan

Gong

et al. 2021

Advances in Parallel &Amp; Distributed Processing, and Applications

View full text Add to dashboard Cite

Practical Implementation of Lattice QCD Simulation on SIMD Machines with Intel AVX-512

Cited by 12 publications

References 12 publications

Object-Oriented Implementation of Algebraic Multi-grid Solver for Lattice QCD on SIMD Architectures and GPU Clusters

Object-Oriented Implementation of Algebraic Multi-grid Solver for Lattice QCD on SIMD Architectures and GPU Clusters

General purpose lattice QCD code set Bridge++ 2.0 for high performance computing

Large-Scale Parallelization of Lattice QCD on Sunway TaihuLight Supercomputer

Contact Info

Product

Resources

About