We report a new multi-GPU capable ab initio Hartree-Fock/density functional theory implementation integrated into the open source QUantum Interaction Computational Kernel (QUICK) program. Details on the load balancing algorithms for electron repulsion integrals and exchange correlation quadrature across multiple GPUs are described. Benchmarking studies carried out on up to 4 GPU nodes, each containing 4 NVIDIA V100-SMX2 type GPUs demonstrate that our implementation is capable of achieving excellent load balancing and high parallel efficiency. For representative medium to large size protein/organic molecular systems, the observed efficiencies remained above 86%. The accelerations on NVIDIA A100, P100 and K80 platforms also have realized parallel efficiencies higher than 74%, paving the way for large-scale ab initio electronic structure calculations.
where is the region of integration, the evaluation points and the corresponding weights, = 1, … , . Various traditional deterministic methods have been proposed in the past and are still being used to solve the problem at different dimension (mostly lower dimension). Some of the methods traditionally used for 1-D adaptive integration are Simpson's 3/5-points, Newton-Cotes 8-point, Gauss-Kronrod 7/15-points and Gauss-Kronrod 10/21-points. For integrands in 2-D and 3-D, Newton-Cotes 8-point, Gauss-Kronrod 7/15-points and Gauss-Kronrod 10/21-points are often used. However at higher dimension the execution time for these algorithms become unacceptable since the number of function evaluation grows exponentially with the dimension, necessitating the use of Monte Carlo techniques that have accuracy issues. CUHRE [1] on the other hand is a deterministic algorithm which uses one of several cubature rules of polynomial degree in a globally adaptive subdivision scheme. CUHRE is the best known open source solution for solving multidimensional integration in reasonable amount of time. In moderate dimensions CUHRE is very competitive, particularly if the integrand is well approximated by polynomials. As the dimension increases, the number of points sampled by the cubature rules rises considerably, thereby reducing its usefulness. Also, with the increase in dimension, the execution time of CUHRE is unacceptable due to increase in the number of sampled points.
<div><div><div><p>We report a new multi-GPU capable ab initio Hartree-Fock/density functional theory implementation integrated into the open source QUantum Interaction Computational Kernel (QUICK) program. Details on the load balancing algorithms for electron repulsion integrals and exchange correlation quadrature across multiple GPUs are described. Benchmarking studies carried out on up to 4 GPU nodes, each containing 4 NVIDIA V100-SMX2 type GPUs demonstrate that our implementation is capable of achiev- ing excellent load balancing and high parallel efficiency. For representative medium to large size protein/organic molecular sys- tems, the observed efficiencies remained above 86%. The accelerations on NVIDIA A100, P100 and K80 platforms also have real- ized parallel efficiencies higher than 74%, paving the way for large-scale ab initio electronic structure calculations.</p></div></div></div>
Abstract-We present a memory-efficient algorithm and its implementation for solving multidimensional numerical integration on a cluster of compute nodes with multiple GPU devices per node. The effective use of shared memory is important for improving the performance on GPUs, because of the bandwidth limitation of the global memory. The best known sequential algorithm for multidimensional numerical integration CUHRE uses a large dynamic heap data structure which is accessed frequently. Devising a GPU algorithm that caches a part of this data structure in the shared memory so as to minimizes global memory access is a challenging task. The algorithm presented here addresses this problem. Furthermore we propose a technique to scale this algorithm to multiple GPU devices. The algorithm was implemented on a cluster of Intel R Xeon R CPU X5650 compute nodes with 4 Tesla M2090 GPU devices per node. We observed a speedup of up to 240 on a single GPU device as compared to a speedup of 70 when memory optimization was not used. On a cluster of 6 nodes (24 GPU devices) we were able to obtain a speedup of up to 3250. All speedups here are with reference to the sequential implementation running on the compute node.
<div><div><div><p>We report a new multi-GPU capable ab initio Hartree-Fock/density functional theory implementation integrated into the open source QUantum Interaction Computational Kernel (QUICK) program. Details on the load balancing algorithms for electron repulsion integrals and exchange correlation quadrature across multiple GPUs are described. Benchmarking studies carried out on up to 4 GPU nodes, each containing 4 NVIDIA V100-SMX2 type GPUs demonstrate that our implementation is capable of achiev- ing excellent load balancing and high parallel efficiency. For representative medium to large size protein/organic molecular sys- tems, the observed efficiencies remained above 86%. The accelerations on NVIDIA A100, P100 and K80 platforms also have real- ized parallel efficiencies higher than 74%, paving the way for large-scale ab initio electronic structure calculations.</p></div></div></div>
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.