Abstract-Quantum Monte Carlo (QMC) methods are used in many scientific computer simulation as their core kernels. The implementation of QMC for distributed NUMA clusters may have load balancing issues at petascale level because of its random nature. We are studying on a simulation for inhomogeneous ultra-cold atoms on optical lattice, for which we developed a QMC algorithm with hybrid MPI+OpenMP programming model. This hybrid model uses the nested parallelism such that the outer loops are parallelized by MPI, while the inner loop relies on OpenMP parallelism. In this work, we presented an adaptive computing approach which learns the system work load dynamically by using our Adaptive Computing Library at run-time and then creates sufficient amount of OpenMP threads based on the availability of the system resources during the execution. The implementation shows that our adaptive approach can get very good load balancing without unnecessary overheads and can significantly provide performance increases up to 20% increases in comparison to MPI-only implementation on a XE6m Cray super computer.Index Terms-Hybrid parallel programming, load balancing, QMC simulation.
I. INTRODUCTIONMany scientific simulations use quantum Monte Carlo method (QMC) at their most time consuming kernels. QMC method provides an accurate description of many-body physics which can be applied to problems relevant to chemistry, biology, physics, material science and even drug design at molecular level. In our collaborative and interdisciplinary work with physics department of George Town University, we want to build scalable and efficient an optical lattice simulator with ultra cold atoms by using inhomogeneous dynamical-mean field theory (IDMFT) in which again most time consuming portions of the simulation are QMC methods [1], [2].These calculations are computationally intensive and require very large high performance computing systems to able to study realistic simulations. Thus, the simulator is originally written with FORTRAN+MPI programming model to run distributed computer cluster with manager-worker paradigm. However, we observed that load balancing problem occurred since each MPI process performing QMC which is random by nature. We implemented the hybrid MPI+OpenMP programming to overcome load balancing issues. However, this cause memory stalling problem on Cray XE6m, where each compute node has non-uniform memory access (NUMA) memory design. When the system is fully busy with MPI processes [3] by launching new OpenMP threads [4] which stresses the memory bandwidth, cause starvation to some threads. To overcome this difficulty, we design Adaptive Computing Library (ACL) which can be used to improve the hybrid MPI+OpenMP or MPI+thread programming model.The cluster system, which consists of shared memory nodes with several multi-core CPUs connected to a high speed network to comprise a distributed memory system, is the most widely available hardware architecture for the high-performance computing community. On these systems, the hybrid para...