Practical Implementation of Lattice QCD Simulation on Intel Xeon Phi Knights Landing

Kanamori, Issaku; Matsufuru, Hideo

doi:10.1109/candar.2017.66

Cited by 5 publications

(13 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An approach keeping the array of structure data layout and inserting pragmas [9] gives 225 GFlops (245 GFlops after correcting the difference in clock cycle). In our previous report [3], which corresponds to the layout 2 without redundant boundary data packing/copy, the best performance on single node was 340 GFlops (4 MPI proc./node). With the same condition, it becomes 369 GFlops whose improvement is mainly due to the refinement of the prefetch.…”

Section: Data Layoutmentioning

confidence: 96%

See 1 more Smart Citation

Practical Implementation of Lattice QCD Simulation on SIMD Machines with Intel AVX-512

Kanamori

Matsufuru

2018

Computational Science and Its Applications – ICCSA 2018

Self Cite

View full text Add to dashboard Cite

We investigate implementation of lattice Quantum Chromodynamics (QCD) code on the Intel AVX-512 architecture. The most time consuming part of the numerical simulations of lattice QCD is a solver of linear equation for a large sparse matrix that represents the strong interaction among quarks. To establish widely applicable prescriptions, we examine rather general methods for the SIMD architecture of AVX-512, such as using intrinsics and manual prefetching, for the matrix multiplication. Based on experience on the Oakforest-PACS system, a large scale cluster composed of Intel Xeon Phi Knights Landing, we discuss the performance tuning exploiting AVX-512 and code design on the SIMD architecture and massively parallel machines. We observe that the same code runs efficiently on an Intel Xeon Skylake-SP machine.

show abstract

Section: Data Layoutmentioning

confidence: 96%

“…As a testbed of our analysis, we choose two types of fermion matrices together with an iterative linear equation solver. In our previous report [2,3], we developed a code along the above policy and applied it to KNL. In this paper, in addition to improved performance, we rearrange these prescriptions so that each effect is more apparent.…”

Section: Introductionmentioning

confidence: 99%

Practical Implementation of Lattice QCD Simulation on SIMD Machines with Intel AVX-512

Kanamori

Matsufuru

2018

Computational Science and Its Applications – ICCSA 2018

Self Cite

View full text Add to dashboard Cite

show abstract

“…Data layouts SoA, 20,22,23,28,30,36,50,79,82,90 AoS, 9 AoSoA 57,90 Data alignment 6,9,14,18,20,24,44,45,52,53,66,79,84,90 Padding 4,7,9,20,24,44,52,53,79,82,91 Dependency disambiguation 15,28,36,82,91 Prefetching Software, 4,7,9,14,17,22,23,40,41,50,…”

Section: Ta B L E 3 Optimization Strategiesmentioning

confidence: 99%

“…Kanamori et al 66 accelerate "lattice quantum chromodynamics" (QCD) code on KNL. For the complex vector data, the real and imaginary parts are placed consecutively in the memory.…”

Section: Prefetchingmentioning

confidence: 99%

“…Vectorization-related Difficulties in or imperfect vectorization, 2,22,25 incompatibility of SSE-style vectorization 2 Poor scalability with increasing threads/nodes 2, 16,20,29,37,57,62,66,84 Slow operations reduction operation, 86 gather-scatter operations, 82 divide operations 13,26 Data dependency 99 Load-imbalance between threads of Phi, 13,17,23,50 CPU and Phi 18,43,50,76 Contention on shared components ring stop, 29,37,84 disk-access 37 Others lack of support for vector atomic instructions, 37 thread-creation overhead, 30 TLB misses 42,90 intrinsic functions for introducing vector variables and processes. This approach provides better performance compared to the compiler performing auto-vectorization.…”

Section: Comparison With Cpumentioning

confidence: 99%

See 1 more Smart Citation

A survey on evaluating and optimizing performance of Intel Xeon Phi

Mittal

2020

Concurrency and Computation

View full text Add to dashboard Cite

Summary Intel's Xeon Phi combines the parallel processing power of a many‐core accelerator with the programming ease of CPUs. In this paper, we present a survey of works that study the architecture of Phi and use it as an accelerator for a broad range of applications. We review performance optimization strategies as well as the factors that bottleneck the performance of Phi. We also review works that perform comparison or collaborative execution of Phi with CPUs and GPUs. This paper will be useful for researchers and developers in the area of computer‐architecture and high‐performance computing.

show abstract

Object-Oriented Implementation of Algebraic Multi-grid Solver for Lattice QCD on SIMD Architectures and GPU Clusters

Kanamori¹,

Ishikawa²,

Matsufuru³

2021

Computational Science and Its Applications – ICCSA 2021

Self Cite

View full text Add to dashboard Cite

A portable implementation of elaborated algorithm is important to use variety of architectures in HPC applications. In this work we implement and benchmark an algebraic multi-grid solver for Lattice QCD on three different architectures, Intel Xeon Phi, Fujitsu A64FX, and NVIDIA Tesla V100, in keeping high performance and portability of the code based on the object-oriented paradigm. Some parts of code are specific to an architecture employing appropriate data layout and tuned matrix-vector multiplication kernels, while the implementation of abstract solver algorithm is common to all architectures. Although the performance of the solver depends on tuning of the architecture-dependent part, we observe reasonable scaling behavior and better performance than the mixed precision BiCGSstab solvers.

show abstract

Practical Implementation of Lattice QCD Simulation on Intel Xeon Phi Knights Landing

Cited by 5 publications

References 12 publications

Practical Implementation of Lattice QCD Simulation on SIMD Machines with Intel AVX-512

Practical Implementation of Lattice QCD Simulation on SIMD Machines with Intel AVX-512

A survey on evaluating and optimizing performance of Intel Xeon Phi

Object-Oriented Implementation of Algebraic Multi-grid Solver for Lattice QCD on SIMD Architectures and GPU Clusters

Contact Info

Product

Resources

About