Abstract:To improve the efficiency of Gaussian integral evaluation
on modern
accelerated architectures, FLOP-efficient Obara-Saika-based recursive
evaluation schemes are optimized for the memory footprint. For the
3-center 2-particle integrals that are key for the evaluation of Coulomb
and other 2-particle interactions in the density-fitting approximation,
the use of multiquantal recurrences (in which multiple quanta are
created or transferred at once) is shown to produce significant memory
savings. Other innovations i… Show more
“…To optimize the bandwidth, it is necessary to maximize the occupancy, which means minimizing the fast memory footprint. Our approach is to evaluate the 1-index integrals using eq for monotonically decreasing auxiliary indices m , reusing the memory occupied by false[boldr̃false]false(m+1false) to store false[boldr̃false]false(mfalse); this is akin to the in-place evaluation techniques we explored in ref . The prefactors in eq and metadata (maps from index triplets boldr̃ to their ordinals) are independent of m .…”
Section: Methodsmentioning
confidence: 99%
“…Each thread computes a 2-day round-robin range of 2-index integrals, one at a time, to approximately balance the load between threads. By minimizing the memory footprint of false[boldr̃false]false(mfalse) integrals using the in-place evaluation technique of ref , it is possible to evaluate 1-index integrals even for the [ii|ii] integrals using only 23 kB of shared memory. This allows us to assign 4 thread blocks to each SM even on the V100 GPU with a relatively modest amount of shared memory per SM and make the performance of the 2-index integral evaluation less dependent on the hardware details to ensure efficient execution on current and future generations of accelerators.…”
Section: Methodsmentioning
confidence: 99%
“…To make the performance analysis easier and more meaningful, we focus here on microbenchmarking the integral kernels, i.e., we analyze their performance for specific integral classes rather than computing, e.g., the entire Fock operator matrix and/or the entire set of integrals for a given problem. While microbenchmarking is less common, ,, it provides a more detailed model of performance by removing extra details (e.g., screening) that can greatly influence performance of integration benchmarks. We strongly encourage others to follow suit.…”
Section: Performancementioning
confidence: 99%
“…An even more serious issue is the high memory footprint of such kernels that exceeds the size of the lowest levels of memory hierarchy (registers and scratchpad memory) even for relatively low angular momenta, thereby reducing the performance. Although detailed performance can be difficult to extract from the numerous publications dedicated to Gaussian AO integral evaluation on GPUs, − the performance of the Head-Gordon–Pople , refinement of the Obara-Saika recurrence scheme implemented by Barca et al is in our experience representative: whereas the performance for 4-center integrals of low total angular momenta (up to (pp|pp), with s, p, d, f, g, h, i, k... denoting Gaussian AOs with angular momenta l = 0, 1, 2, 3, 4, 5, 6, 7..., respectively), was found to reach a substantial (20–50%) fraction of the peak FP64 FLOP rate, the performance for higher angular momenta dropped rapidly to 2% of the peak rate for the (dd|dd) integrals. Another, albeit a less direct, datapoint comes from a study by Johnson et al who observed significant loss of efficiency of the GPU code for the Coulomb matrix evaluation (using McMurchie-Davidson recurrence-based formalism) vs the CPU counterpart as the basis set is enlarged to include higher angular momenta.…”
Section: Introductionmentioning
confidence: 98%
“…Recently, we reconsidered the design of Gaussian AO integral algorithms in order to optimize their memory footprint . For the specific case of 3-center Gaussian AO integrals, we argued that even for high angular momenta the Obara-Saika recurrence-based schemes , would outperform the Rys quadrature , commonly thought to lead to optimally compact memory footprints; however, even with several algorithmic and programming innovations the performance was reasonable for integrals up to (ff|f) but dropped for higher angular momenta to only a few percent of the hardware peak.…”
“…To optimize the bandwidth, it is necessary to maximize the occupancy, which means minimizing the fast memory footprint. Our approach is to evaluate the 1-index integrals using eq for monotonically decreasing auxiliary indices m , reusing the memory occupied by false[boldr̃false]false(m+1false) to store false[boldr̃false]false(mfalse); this is akin to the in-place evaluation techniques we explored in ref . The prefactors in eq and metadata (maps from index triplets boldr̃ to their ordinals) are independent of m .…”
Section: Methodsmentioning
confidence: 99%
“…Each thread computes a 2-day round-robin range of 2-index integrals, one at a time, to approximately balance the load between threads. By minimizing the memory footprint of false[boldr̃false]false(mfalse) integrals using the in-place evaluation technique of ref , it is possible to evaluate 1-index integrals even for the [ii|ii] integrals using only 23 kB of shared memory. This allows us to assign 4 thread blocks to each SM even on the V100 GPU with a relatively modest amount of shared memory per SM and make the performance of the 2-index integral evaluation less dependent on the hardware details to ensure efficient execution on current and future generations of accelerators.…”
Section: Methodsmentioning
confidence: 99%
“…To make the performance analysis easier and more meaningful, we focus here on microbenchmarking the integral kernels, i.e., we analyze their performance for specific integral classes rather than computing, e.g., the entire Fock operator matrix and/or the entire set of integrals for a given problem. While microbenchmarking is less common, ,, it provides a more detailed model of performance by removing extra details (e.g., screening) that can greatly influence performance of integration benchmarks. We strongly encourage others to follow suit.…”
Section: Performancementioning
confidence: 99%
“…An even more serious issue is the high memory footprint of such kernels that exceeds the size of the lowest levels of memory hierarchy (registers and scratchpad memory) even for relatively low angular momenta, thereby reducing the performance. Although detailed performance can be difficult to extract from the numerous publications dedicated to Gaussian AO integral evaluation on GPUs, − the performance of the Head-Gordon–Pople , refinement of the Obara-Saika recurrence scheme implemented by Barca et al is in our experience representative: whereas the performance for 4-center integrals of low total angular momenta (up to (pp|pp), with s, p, d, f, g, h, i, k... denoting Gaussian AOs with angular momenta l = 0, 1, 2, 3, 4, 5, 6, 7..., respectively), was found to reach a substantial (20–50%) fraction of the peak FP64 FLOP rate, the performance for higher angular momenta dropped rapidly to 2% of the peak rate for the (dd|dd) integrals. Another, albeit a less direct, datapoint comes from a study by Johnson et al who observed significant loss of efficiency of the GPU code for the Coulomb matrix evaluation (using McMurchie-Davidson recurrence-based formalism) vs the CPU counterpart as the basis set is enlarged to include higher angular momenta.…”
Section: Introductionmentioning
confidence: 98%
“…Recently, we reconsidered the design of Gaussian AO integral algorithms in order to optimize their memory footprint . For the specific case of 3-center Gaussian AO integrals, we argued that even for high angular momenta the Obara-Saika recurrence-based schemes , would outperform the Rys quadrature , commonly thought to lead to optimally compact memory footprints; however, even with several algorithmic and programming innovations the performance was reasonable for integrals up to (ff|f) but dropped for higher angular momenta to only a few percent of the hardware peak.…”
With the growing reliance of modern supercomputers on accelerator-based architecture such a graphics processing units (GPUs), the development and optimization of electronic structure methods to exploit these massively parallel resources has become a recent priority. While significant strides have been made in the development GPU accelerated, distributed memory algorithms for many modern electronic structure methods, the primary focus of GPU development for Gaussian basis atomic orbital methods has been for shared memory systems with only a handful of examples pursing massive parallelism. In the present work, we present a set of distributed memory algorithms for the evaluation of the Coulomb and exact exchange matrices for hybrid Kohn–Sham DFT with Gaussian basis sets via direct density-fitted (DF-J-Engine) and seminumerical (sn-K) methods, respectively. The absolute performance and strong scalability of the developed methods are demonstrated on systems ranging from a few hundred to over one thousand atoms using up to 128 NVIDIA A100 GPUs on the Perlmutter supercomputer.
The traditional foundation of science lies on the cornerstones of theory and experiment. Theory is used to explain experiment, which in turn guides the development of theory. Since the advent of computers and the development of computational algorithms, computation has risen as the third cornerstone of science, joining theory and experiment on an equal footing. Computation has become an essential part of modern science, amending experiment by enabling accurate comparison of complicated theories to sophisticated experiments, as well as guiding by triage both the design and targets of experiments and the development of novel theories and computational methods. Like experiment, computation relies on continued investment in infrastructure: it requires both hardware (the physical computer on which the calculation is run) as well as software (the source code of the programs that performs the wanted simulations). In this Perspective, I discuss present-day challenges on the software side in computational chemistry, which arise from the fast-paced development of algorithms, programming models, as well as hardware. I argue that many of these challenges could be solved with reusable open source libraries, which are a public good, enhance the reproducibility of science, and accelerate the development and availability of state-of-the-art methods and improved software.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.