2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2017
DOI: 10.1109/ipdps.2017.9
|View full text |Cite
|
Sign up to set email alerts
|

26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
18
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 23 publications
(18 citation statements)
references
References 33 publications
0
18
0
Order By: Relevance
“…For this configuration, we first use our model to predict the performance for all valid parameter sets. Specifically, we use b T ∈ [1,16] for 2D, and b T ∈ [1,8] for 3D stencils, respectively. b S i for 2D stencils is chosen from the set of {128, 256, 512}, and for 3D, is chosen from {16×16, 32×16, 32×32, 64×16}.…”
Section: Parameter Tuningmentioning
confidence: 99%
See 2 more Smart Citations
“…For this configuration, we first use our model to predict the performance for all valid parameter sets. Specifically, we use b T ∈ [1,16] for 2D, and b T ∈ [1,8] for 3D stencils, respectively. b S i for 2D stencils is chosen from the set of {128, 256, 512}, and for 3D, is chosen from {16×16, 32×16, 32×32, 64×16}.…”
Section: Parameter Tuningmentioning
confidence: 99%
“…x ∈ [1,4] c (x,y) f (x,y) + f (x +i,y+j, z+k ) )/c 0 registers are used per thread, for single and double-precision data types, respectively. Hence, we use these limits to prune configurations which are expected to require more than the hardware limits of 255 registers per thread or 65,536 registers per SM.…”
Section: Parameter Tuningmentioning
confidence: 99%
See 1 more Smart Citation
“…With each process mapped to one CG, a box is processed by all the 64 CPEs in the CG 1 . For data partitioning on the CPE cluster, we adopt the widely used 2.5D partition [5,40,26] for the box data and leveraged the double buffering mechanism [22,5,34]. As shown in Figure 4, the data on an ij plane are distributed onto the CPE cluster, with each CPE processing one sub-block [5].…”
Section: Parallelization Of Stencil Operationsmentioning
confidence: 99%
“…Second, as a 16×8 sub-block spreads on 8 rows, the data are loaded in short stanza. An important approach to alleviate the memory access overhead is using collective data loading [5]. To facilitate collective data access, several threads form a "thread group".…”
Section: Bandwidth Oriented Optimizationmentioning
confidence: 99%