MILC Code Performance on High End CPU and GPU Supercomputer Clusters

DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, D.

doi:10.1051/epjconf/201817502009

Cited by 6 publications

(5 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[32] and [33]. The performance of the MILC code, on various architectures, is enhanced by using QOP [34], QPhiX [35][36][37][38], or QUDA [39][40][41][42].…”

Section: Milc Collaborationmentioning

confidence: 99%

Lattice gauge ensembles and data management

Bali¹,

Bignell²,

Francis³

et al. 2023

Proceedings of the 39th International Symposium on Lattice Field Theory — PoS(LATTICE2022)

View full text Add to dashboard Cite

Lattice gauge ensembles and data managementGunnar Bali et al.data consumer follows good scientific practice and properly acknowledges the source of the data and gives credit to the data providers.• In a community-wide context, the data providers can make their valuable data available on some storage infrastructure at no extra cost in terms of human or hardware resources. Declaring data "public" will make these known to other researchers who will frequently use these in other projects, so that the data providers receive recognition and citations.In the real world, where many of those responsible for generating, storing and managing the data are on temporary positions and where large, globally accessible, long-term storage is not for free, the situation is more challenging.In this contribution, we collect the present status of ensemble generation to inform both data consumers and providers about the availability of gauge ensembles and present practices. We restrict ourselves to simulations of QCD. At present, these are mostly carried out using 𝑁 𝑓 = 2 + 1, 𝑁 𝑓 = 2 + 1 + 1 and also 𝑁 𝑓 = 1 + 1 + 1 + 1 sea quark flavours 𝑞 = 𝑢, 𝑑, 𝑠, 𝑐, with various fermion discretizations. Naturally, we can only cover simulations by the groups who responded to the call. The next section provides the current status. This is followed by a brief summary.

show abstract

“…[32] and [33]. The performance of the MILC code, on various architectures, is enhanced by using QOP [34], QPhiX [35][36][37][38], or QUDA [39][40][41][42].…”

Section: Milc Collaborationmentioning

confidence: 99%

Lattice gauge ensembles and data management

Bali¹,

Bignell²,

Francis³

et al. 2023

Proceedings of the 39th International Symposium on Lattice Field Theory — PoS(LATTICE2022)

View full text Add to dashboard Cite

show abstract

“…Overall, this takes 2N c • 4 • 80 flops. 30 For 36 of the 40 stored directions this matrix is not unitary; compare the caption of Tab. 13.…”

Section: E3 Brillouin Laplace Operatormentioning

confidence: 99%

“…This brief exposition of the subject cannot do justice to the effort spent by other authors to maximize performance on a specific architecture for a given Dirac operator D. Recent review talks on the interplay between algorithms and machines in lattice QCD include [19][20][21][22][23]. In addition, there is a number of HPC projects in lattice QCD with similar objectives on several architectures [24][25][26][27][28][29][30][31][32][33][34][35]. Preliminary accounts 6 of this work were given in [36,37].…”

Section: Introductionmentioning

confidence: 99%

Fast and flexible implementations of Wilson, Brillouin and Susskind fermions in lattice QCD

Dürr¹

2021

Preprint

View full text Add to dashboard Cite

A modern Fortran implementation of three Dirac operators (Wilson, Brillouin, Susskind) in lattice QCD is presented, based on OpenMP shared-memory parallelization and SIMD pragmas. The main idea is to apply a Dirac operator to N v vectors simultaneously, to ease the memory bandwidth bottleneck. All index computations are left to the compiler and maximum weight is given to portability and flexibility. The lattice volume, N x N y N z N t , the number of colors, N c , and the number of right-hand sides, N v , are parameters defined at compile time. Several memory layout options are compared. The code performs well on modern many-core architectures (480 Gflop/s, 880 Gflop/s, and 780 Gflop/s with N v = 12 for the three operators in single precision on a 72-core KNL processor, a 2×24-core Skylake node yields similar results). Explicit run-time tests with CG/BiCGstab inverters confirm that the memory layout is relevant for the KNL, but less so for the Skylake architecture. The ancillary code distribution contains all routines, including the single, double, and mixed precision Krylov space solvers, to render it self-contained and ready-to-use.1 Here "frequently" means O(10 5 ) times, "large" implies a n × n matrix with n = 402 653 184 for a Wilson fermion on a 64 3 × 128 lattice, and depending on the quark mass the condition number of D † D is often in the range 10 6 . . . 10 8 . The factor 10 5 reflects the production of an ensemble of 1000 gauge configurations, separated by ten τ = 1 HMC trajectories, assuming that each of these requires O(10) inversions.

show abstract

“…From a HPC viewpoint, a clear advantage of this operator with precomputed V µ is that its stencil is restricted to sites which are at most one hop away. Still, it is not trivial to reach an acceptable performance on a many-core architecture [4,5].…”

Section: Staggered Kernel Details and Performancementioning

confidence: 99%

Three Dirac operators on two architectures with one piece of code and no hassle

Dürr¹

2019

Proceedings of the 36th Annual International Symposium on Lattice Field Theory — PoS(LATTICE2018)

View full text Add to dashboard Cite

A simple minded approach to implement three discretizations of the Dirac operator (staggered, Wilson, Brillouin) on two architectures (KNL and core i7) is presented. The idea is to use a high-level compiler along with OpenMP parallelization and SIMD pragmas, but to stay away from cache-line optimization and/or assembly-tuning. The implementation is for N v right-handsides, and this extra index is used to fill the SIMD pipeline. On one KNL node single precision performance figures for N c = 3, N v = 12 read 475 Gflop/s, 345 Gflop/s, and 790 Gflop/s for the three discretization schemes, respectively.

show abstract

MILC Code Performance on High End CPU and GPU Supercomputer Clusters

Cited by 6 publications

References 5 publications

Lattice gauge ensembles and data management

Lattice gauge ensembles and data management

Fast and flexible implementations of Wilson, Brillouin and Susskind fermions in lattice QCD

Three Dirac operators on two architectures with one piece of code and no hassle

Contact Info

Product

Resources

About