Toward accelerated stencil computation by adapting tensor core unit on GPU

Liu, Xiaoyan; Liu, Yi; Yang, Hailong; Liao, Jianjin; Li, Mingzhen; Luan, Zhongzhi; Qian, Depei

doi:10.1145/3524059.3532392

Cited by 12 publications

(2 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Nearly a decade later, with the surge of Artificial Intelligence (AI), the community realized that the performance of GPUs was not high enough to properly handle the new Deep Learning models being developed. For this reason, near 2017, NVIDIA introduced tensor cores [3][4][5][6][7][8][9][10][11][12] inside the chip to further accelerate the performance of all AI applications. GPU Tensor cores are Application Specific Integrated Circuits (ASICs), or simply specific-purpose cores that perform fast matrix multiply accumulate (MMA) operations.…”

Section: From General Purpose To Specific Purposementioning

confidence: 99%

“…Successful research has been done in the recent years. In the case of tensor cores, new ways have been proposed to further accelerate arithmetic reductions [16,13,[5][6][7][8][9][10][11][12][17][18][19][20][21] prefix sum [4-12, 17-21, 22-29] Fast Fourier Transform [22], [10], [23], [5], stencil computations for PDE simulations [11] and even fractals [14,. In general, all of these works achieve significant higher performance when compared to doing it traditionally in GPU.…”

Section: New Research Opportunitiesmentioning

confidence: 99%

See 1 more Smart Citation

Untitled

2024

CTCSA

View full text Add to dashboard Cite

Parallel processors have undergone a profound transformation in recent years, transitioning from homogeneous generalpurpose units to a heterogeneous ecosystem comprising a mix of general and specific-purpose cores on a single chip. This shift, driven by the demands of Artificial Intelligence (AI) and computer graphics applications, has not only altered the architecture of processors but has also introduced novel challenges in optimizing algorithms for parallel execution. In this brief review, we delve into the evolution of parallel processors and explore the research challenges arising from this shift. We will be focusing on the particular case of GPUs, where tensor cores and ray tracing cores have created new research opportunities on finding what other applications, different from AI and graphics, could be reformulated as a series of tensor/ray-tracing core operations and further accelerate their performance compared to their regular GPU implementation.

show abstract

Section: From General Purpose To Specific Purposementioning

confidence: 99%

Section: New Research Opportunitiesmentioning

confidence: 99%

Untitled

2024

CTCSA

View full text Add to dashboard Cite

show abstract

A compression-based memory-efficient optimization for out-of-core GPU stencil computation

et al. 2023

View full text Add to dashboard Cite

Stencil computation is an extensively-utilized class of scientific-computing applications that can be efficiently accelerated by graphics processing units (GPUs). Out-of-core approaches enable a GPU to handle large stencil codes whose data size is beyond the memory capacity of the GPU. However, current research on out-of-core stencil computation primarily focus on minimizing the amount of data transferred between the CPU and GPU. Few studies consider simultaneously optimizing data transfer and kernel execution. To fill the research gap, this work presents a synergy between on-and off-chip data reuse for out-of-core stencil codes, termed SO2DR. First, overlapping regions between data chunks are shared in the off-chip memory to eliminate redundant CPU-GPU data transfer. Secondly, redundant computation at the off-chip memory level is intentionally introduced to decouple kernel execution from region sharing, hence enabling data reuse in the onchip memory. Experimental results demonstrate that SO2DR significantly enhances the kernel-execution performance while reducing the CPU-GPU data-transfer time. Specifically, SO2DR achieves average speedups of 2.78× and 1.14× for five stencil benchmarks, compared to an out-of-core stencil code which is free of redundant transfer and computation, and an in-core stencil code which is free of data transfer, respectively.

show abstract