26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight

Ao, Yulong; Yang, Chao; Wang, Xinliang; Xue, Wei; Fu, Haohuan; Liu, Fangfang; Gan, Lin; Xu, Ping; Ma, Wei

doi:10.1109/ipdps.2017.9

Cited by 23 publications

(18 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For this configuration, we first use our model to predict the performance for all valid parameter sets. Specifically, we use b T ∈ [1,16] for 2D, and b T ∈ [1,8] for 3D stencils, respectively. b S i for 2D stencils is chosen from the set of {128, 256, 512}, and for 3D, is chosen from {16×16, 32×16, 32×32, 64×16}.…”

Section: Parameter Tuningmentioning

confidence: 99%

“…x ∈ [1,4] c (x,y) f (x,y) + f (x +i,y+j, z+k ) )/c 0 registers are used per thread, for single and double-precision data types, respectively. Hence, we use these limits to prune configurations which are expected to require more than the hardware limits of 255 registers per thread or 65,536 registers per SM.…”

Section: Parameter Tuningmentioning

confidence: 99%

“…We conduct a large-scale parameter search to find the optimal parameters for each combination of stencil pattern and GPU. Here, around 10,000 and 5,000 parameter configurations are explored for each 2D (b T = [2,20], b S = [1, 32] × [32,2048], n thr = [1, 32] × [32,1024]) and 3D stencil ( [2,12], [1,4] × [1,32] × [32,256], [1,4] × [1,32] × [32,256]), respectively. We set 8,192 2 and 512 3 as 2D/3D grid size and 120 as iteration count for parameter search.…”

Section: Parameter Tuningmentioning

confidence: 99%

See 2 more Smart Citations

AN5D: automated stencil framework for high-degree temporal blocking on GPUs

Kazuaki

Zohouri

Wahib

et al. 2020

Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization

View full text Add to dashboard Cite

Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architecture and memory hierarchy of GPUs to achieve high performance is difficult. We propose AN5D, an automated stencil framework which is capable of automatically transforming and optimizing stencil patterns in a given C source code, and generating corresponding CUDA code. Parameter tuning in our framework is guided by our performance model. Our novel optimization strategy reduces shared memory and register pressure in comparison to existing implementations, allowing performance scaling up to a temporal blocking degree of 10. We achieve the highest performance reported so far for all evaluated stencil benchmarks on the state-of-the-art Tesla V100 GPU.CCS Concepts • Software and its engineering → Source code generation.

show abstract

Section: Parameter Tuningmentioning

confidence: 99%

Section: Parameter Tuningmentioning

confidence: 99%

Section: Parameter Tuningmentioning

confidence: 99%

See 1 more Smart Citation

AN5D: automated stencil framework for high-degree temporal blocking on GPUs

Kazuaki

Zohouri

Wahib

et al. 2020

Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization

View full text Add to dashboard Cite

show abstract

“…With each process mapped to one CG, a box is processed by all the 64 CPEs in the CG 1 . For data partitioning on the CPE cluster, we adopt the widely used 2.5D partition [5,40,26] for the box data and leveraged the double buffering mechanism [22,5,34]. As shown in Figure 4, the data on an ij plane are distributed onto the CPE cluster, with each CPE processing one sub-block [5].…”

Section: Parallelization Of Stencil Operationsmentioning

confidence: 99%

“…Second, as a 16×8 sub-block spreads on 8 rows, the data are loaded in short stanza. An important approach to alleviate the memory access overhead is using collective data loading [5]. To facilitate collective data access, several threads form a "thread group".…”

Section: Bandwidth Oriented Optimizationmentioning

confidence: 99%

Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight

Yang

et al. 2019

Cluster Comput

Self Cite

View full text Add to dashboard Cite

Benchmarks for supercomputers are important tools, not only for evaluating and ranking modern supercomputers, but also for providing hints for future architecture design. As a new benchmark, HPGMG (High Performance Geometric Multigrid) solves a linear equation set with a full geometric multi-grid algorithm. It involves computation on different scales, data movement with various volumes, global communication and neighbor communication with both large and small messages, etc., and is more correlated to real world applications than traditional benchmarks such as LINPACK. Therefore, it is desirable to examine how well HPGMG can perform on leadership supercomputers such as Sunway Taihulight. Sunway Taihulight, the No. 1 supercomputer in the Top 500 list from June 2016 to June 2018, which uses a specially designed many-core architecture SW26010, is of great interest to the community of high performance computing. With careful analysis and code design, we came up with an efficient implementation of HPGMG on SW26010 processors. We not only employed traditional optimization techniques such as 2.5D

show abstract