Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs

Kim, Jin-Sung; Sukumaran-Rajam, Aravind; Hong, Changwan; Panyala, Ajay; Srivastava, Rohit; Krishnamoorthy, Sriram; Sadayappan, P.

doi:10.1145/3205289.3205296

Cited by 12 publications

(6 citation statements)

References 33 publications

(38 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…al. [24] improves tensor contractions for coupled cluster methods in quantum chemistry by fusing multiple contractions. However, their approach performs transpose in shared memory and these tensor contractions are different from contractions in Kron-Matmul.…”

Section: Related Workmentioning

confidence: 99%

Fast Kronecker Matrix-Matrix Multiplication on GPUs

Jangda,

Yadav

2024

Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

Kronecker Matrix-Matrix Multiplication (Kron-Matmul) is the multiplication of a matrix with the Kronecker Product of several smaller matrices. Kron-Matmul is a core operation for many scientific and machine learning computations. State-of-the-art Kron-Matmul implementations utilize existing tensor algebra operations, such as matrix multiplication, transpose, and tensor matrix multiplication. However, this design choice prevents several Kron-Matmul specific optimizations, thus, leaving significant performance on the table.To address this issue, we present FastKron, an efficient technique for Kron-Matmul on single and multiple GPUs. FastKron is independent of linear algebra operations enabling several new optimizations for Kron-Matmul. Thus, it performs up to 40.7× and 7.85× faster than existing implementations on 1 and 16 GPUs respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

Fast Kronecker Matrix-Matrix Multiplication on GPUs

Jangda,

Yadav

2024

Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

show abstract

“…For example, the current V100 GPUs group 32 cores into a warp, and there are 160 warps to a GPU 54,55 . Various existing computational chemistry codes have been adapted 53,[56][57][58][59][60][61][62][63][64][65][66][67] or designed from the outset (e.g., TeraChem) 67,68 to use GPUs.…”

Section: Hardware and Software Evolution Challengesmentioning

confidence: 99%

“…High throughput is achieved by operating many warps simultaneously. For example, the current V100 GPUs group 32 cores into a warp, and there are 160 warps to a GPU. , Various existing computational chemistry codes have been adapted ,− or designed from the outset (e.g., TeraChem) , to use GPUs.…”

Section: Hardware and Software Evolution Challengesmentioning

confidence: 99%

From NWChem to NWChemEx: Evolving with the Computational Chemistry Landscape

et al. 2021

Self Cite

View full text Add to dashboard Cite

Since the advent of the first computers, chemists have been at the forefront of using computers to understand and solve complex chemical problems. As the hardware and software have evolved, so have the theoretical and computational chemistry methods and algorithms. Parallel computers clearly changed the common computing paradigm in the late 1970s and 80s, and the field has again seen a paradigm shift with the advent of graphical processing units. This review explores the challenges and some of the solutions in transforming software from the terascale to the petascale and now to the upcoming exascale computers. While discussing the field in general, NWChem and its redesign, NWChemEx, will be highlighted as one of the early co-design projects to take advantage of massively parallel computers and emerging software standards to enable large scientific challenges to be tackled.

show abstract

“…Sparse tensor contraction. Dense tensor contraction has been studied for decades on diverse hardware platforms [5,19,21,27,28,32,34,42,50,65,72,73], in scientific computing including chemistry, physics, and mechanics. The state-of-the-art studies focus on block-sparse tensor contractions with dense blocks in tensors.…”

Section: Related Workmentioning

confidence: 99%

Athena

Liu

Gioiosa

et al. 2021

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

Sparse tensor contraction sequence has been widely employed in many fields, such as chemistry and physics. However, how to efficiently implement the sequence faces multiple challenges, such as redundant computations and memory operations, massive memory consumption, and inefficient utilization of hardware. To address the above challenges, we introduce Athena, a high-performance framework for SpTC sequences. Athena introduces new data structures, leverages emerging Optane-based heterogeneous memory (HM) architecture, and adopts stage parallelism. In particular, Athena introduces shared hash table-represented sparse accumulator to eliminate unnecessary input processing and data migration; Athena uses a novel data-semantic guided dynamic migration solution to make the best use of the Optane-based HM for high performance; Athena also co-runs execution phases with different characteristics to enable high hardware utilization. Evaluating with 12 datasets, we show that Athena brings 327-7362× speedup over the state-ofthe-art SpTC algorithm. With the dynamic data placement guided by data semantics, Athena brings performance improvement on Optane-based HM over a state-of-the-art software-based data management solution, a hardware-based data management solution, and PMM-only by 1.58×, 1.82×, and 2.34× respectively. Athena also showcases its effectiveness in quantum chemistry and physics scenarios. CCS CONCEPTS• Mathematics of computing → Mathematical software performance; • Computing methodologies → Shared memory algorithms.

show abstract

Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs

Cited by 12 publications

References 33 publications

Fast Kronecker Matrix-Matrix Multiplication on GPUs

Fast Kronecker Matrix-Matrix Multiplication on GPUs

From NWChem to NWChemEx: Evolving with the Computational Chemistry Landscape

Athena

Contact Info

Product

Resources

About