Low Precision Processing for High Order Stencil Computations

Singh, Gagandeep; Diamantopoulos, Dionysios; Stuijk, Sander; Hagleitner, Christoph; Corporaal, Henk

doi:10.1007/978-3-030-27562-4_29

Cited by 9 publications

(8 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Accuracy impact with FP16 precision -Although traditional stencil applications use higher precision such as FP64 and FP32, there are increasing research works demonstrate the success of using lower precision such as FP16 in stencil application [9,12,35,36].…”

Section: Discussionmentioning

confidence: 99%

Toward accelerated stencil computation by adapting tensor core unit on GPU

Liu

Yang

et al. 2022

Proceedings of the 36th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

The Tensor Core Unit (TCU) has been increasingly adopted on modern high performance processors, specialized in boosting the performance of general matrix multiplication (GEMM). Due to its highly optimized hardware design, TCU can significantly accelerate GEMM-based operations widely used in scientific as well as deep learning applications. However, there is few work exploiting TCU to accelerate non-GEMM operations such as stencil computation that is also important in the field of high performance computing. To the best of our knowledge, there is no previous work that adapts stencil computation to TCU efficiently by considering its unique characteristics. In this paper, we propose a new method called TCstencil to adapt TCU for accelerating stencil computation. Specifically, we re-design the stencil computation as a series of reduction and summation operations in order to leverage the computing power of TCU. In addition, we propose corresponding optimizations for better exploiting TCU and memory hierarchy on GPU. We evaluate our method with different stencils and input mesh sizes on NVIDIA A100 and V100 GPUs. The experiment results demonstrate our method can achieve superior performance compared to the state-of-the-art stencil optimization frameworks.

show abstract

Section: Discussionmentioning

confidence: 99%

Toward accelerated stencil computation by adapting tensor core unit on GPU

Liu

Yang

et al. 2022

Proceedings of the 36th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…Unlike stencils found in the literature [51,55,56,135,154,168,169], real-world compound stencils consist of a collection of stencils that perform a sequence of element-wise computations with complex interdependencies. Such compound kernels have complex memory access patterns and low arithmetic intensity because they have limited operations per loaded value.…”

Section: Related Workmentioning

confidence: 99%

“…Szustak et al accelerate the MPDATA advection scheme on multi-core CPU [159] and computational luid dynamics kernels on FPGA [133]. Singh et al [154] explore the applicability of diferent number formats and exhaustively search for the appropriate bit-width for memory-bound stencil kernels to improve performance and energy eiciency with minimal loss in the accuracy. Bianco et al [41] optimize the COSMO weather prediction model for GPUs.…”

Section: Related Workmentioning

confidence: 99%

Accelerating Weather Prediction Using Near-Memory Reconfigurable Fabric

Singh

Diamantopoulos

Gómez-Luna

et al. 2022

ACM Trans. Reconfigurable Technol. Syst.

Self Cite

View full text Add to dashboard Cite

Ongoing climate change calls for fast and accurate weather and climate modeling. However, when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. These implementations are dominated by complex irregular memory access patterns and low arithmetic intensity that pose fundamental challenges to acceleration. To overcome these challenges, we propose and evaluate the use of near-memory acceleration using a reconfigurable fabric with high-bandwidth memory (HBM). We focus on compound stencils that are fundamental kernels in weather prediction models. By using high-level synthesis techniques, we develop NERO, an FPGA+HBM-based accelerator connected through OCAPI (Open Coherent Accelerator Processor Interface) to an IBM POWER9 host system. Our experimental results show that NERO outperforms a 16-core POWER9 system by 5.3 × and 12.7 × when running two different compound stencil kernels. NERO reduces the energy consumption by 12 × and 35 × for the same two kernels over the POWER9 system with an energy efficiency of 1.61 GFLOPS/Watt and 21.01 GFLOPS/Watt. We conclude that employing near-memory acceleration solutions for weather prediction modeling is promising as a means to achieve both high performance and high energy efficiency.

show abstract

“…Szustak et al accelerate the MPDATA advection scheme on multi-core CPU [134] and computational fluid dynamics kernels on FPGA [116]. Singh et al [130] explore the applicability of different number formats and exhaustively search for the appropriate bit-width for memory-bound stencil kernels to improve performance and energy-efficiency with minimal loss in the accuracy. Bianco et al [29] optimize the COSMO weather prediction model for GPUs while Thaler et al [136] port COSMO to a many-core system.…”

Section: Related Workmentioning

confidence: 99%

Accelerating Weather Prediction using Near-Memory Reconfigurable Fabric

Singh,

Diamantopoulos,

Gómez-Luna

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Ongoing climate change calls for fast and accurate weather and climate modeling. However, when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. These implementations are dominated by complex irregular memory access patterns and low arithmetic intensity that pose fundamental challenges to acceleration. To overcome these challenges, we propose and evaluate the use of near-memory acceleration using a reconfigurable fabric with high-bandwidth memory (HBM). We focus on compound stencils that are fundamental kernels in weather prediction models. By using high-level synthesis techniques, we develop NERO, an FPGA+HBM-based accelerator connected through IBM OCAPI (Open Coherent Accelerator Processor Interface) to an IBM POWER9 host system. Our experimental results show that NERO outperforms a 16-core POWER9 system by 5.3× and 12.7× when running two different compound stencil kernels. NERO reduces the energy consumption by 12× and 35× for the same two kernels over the POWER9 system with an energy efficiency of 1.61 GFLOPS/Watt and 21.01 GFLOPS/Watt . We conclude that employing near-memory acceleration solutions for weather prediction modeling is promising as a means to achieve both high performance and high energy efficiency.

show abstract

Low Precision Processing for High Order Stencil Computations

Cited by 9 publications

References 14 publications

Toward accelerated stencil computation by adapting tensor core unit on GPU

Toward accelerated stencil computation by adapting tensor core unit on GPU

Accelerating Weather Prediction Using Near-Memory Reconfigurable Fabric

Accelerating Weather Prediction using Near-Memory Reconfigurable Fabric

Contact Info

Product

Resources

About