A CUDA Implementation of the High Performance Conjugate Gradient Benchmark

Phillips, Everett; Fatica, Massimiliano

doi:10.1007/978-3-319-17248-4_4

Cited by 17 publications

(21 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the fullsystem tests, the overheads of the halo exchange and the global collective with respect to the overall HPCG runtime are only around 7.3% and 5.0%, respectively. For comparison purposes, Table 1 summarizes the HPCG results on several other systems, collected from both published results [20,30,31] and the official HPCG list of June 2017 [9]. It can be seen that although the HPCG-to-HPL ratio of the Sunway platform is relatively low because of the highly limited data-moving capability, the Flop/Byte efficiency [37], which measures the ratio of the HPCG performance to the total memory bandwidth, is comparable to other systems.…”

Section: Full-system Resultsmentioning

confidence: 99%

“…HPCG has drawn increasing attention from both academics and industry since its announcement in 2013. For example, a multicolor reordering technique was employed to improve the performance of HPCG on CPU-GPU heterogeneous clusters [31]. The optimization of HPCG on the K supercomputer was done in Kumahata et al [20], where a block multicoloring method was employed for the parallelization of SymGS.…”

Section: Related Workmentioning

confidence: 99%

“…These operations are usually memory bound with low arithmetic intensities, irregular data access patterns, and neighboring and collective communications, and thus are challenging to optimize on modern supercomputing systems. Since its release, HPCG has been attracting rapidly increased research interests in the HPC community [20,23,24,[29][30][31].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Performance Optimization of the HPCG Benchmark on the Sunway TaihuLight Supercomputer

Yang

Liu

et al. 2018

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

In this article, we present some key techniques for optimizing HPCG on Sunway TaihuLight and demonstrate how to achieve high performance in memory-bound applications by exploiting specific characteristics of the hardware architecture. In particular, we utilize a block multicoloring approach for parallelization and propose methods such as requirement-based data mapping and customized gather collective to enhance the effective memory bandwidth. Experiments indicate that the optimized HPCG code can sustain 77% of the theoretical memory bandwidth and scale to the full system of more than 10 million cores, with an aggregated performance of 480.8 Tflop/s and a weak scaling efficiency of 87.3%.

show abstract

Section: Full-system Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Performance Optimization of the HPCG Benchmark on the Sunway TaihuLight Supercomputer

Yang

Liu

et al. 2018

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…HPL is a benchmark program that determines the solution to Ax = b, which denotes a large-scale dense matrix problem of a linear equation. The performance of HPL is determined by the 64-bit floating-point operation used in multiplication of the dense matrix, which is a major calculation in the methodology of the benchmark program [5,22]. The FLOPS value, which is obtained from HPL, is used as a measure of supercomputing performance in the TOP500 Project, which presents a list of the top 500 fastest supercomputers in the world since 1993.…”

Section: Prior Studies On Supercomputer Performance Measurementmentioning

confidence: 99%

“…However, the majority of current applications compute differential equations that require high memory bandwidth and irregular data access. As a consequence, there is a low correlation between the performances of HPL and the application [22,23,24].…”

Section: Prior Studies On Supercomputer Performance Measurementmentioning

confidence: 99%

Composite Measures of Supercomputer Technology

Kim¹,

On²,

Koh³

et al. 2019

KSII TIIS

View full text Add to dashboard Cite

We have developed composite measures of supercomputer technology, reflecting various factors of supercomputers using Martino's scoring model. CPUs, accelerators, memory, interconnection networks, and power consumption are chosen as factors of the model. The weight values of the factors are derived based on a survey of 129 domestic and international experts. The measured values are then standardized to integrate measurement units of the factors in the model. This model has been applied to 50 supercomputers, and rank correlation analysis was performed using representative measures. As a consequence, the ranking drastically changes except for the 1st and 2nd supercomputers on the TOP500. In addition, the characteristics of memory and interconnection networks influence the ranking, and the results demonstrate that the proposed model has low correlations with HPL and HPCG but a high correlation with Green500. This indicates that power consumption is an important factor that has a significant effect on the measures of supercomputer technology. In addition, it is determined that the differences between the HPL ranking and the proposed model ranking are influenced by power consumption, CPU theoretical peak performance, and main memory bandwidth in order of significance. In conclusion, the composite measures proposed in this study are more suitable for comprehensively describing supercomputer technology than existing performance measures. The findings of this study are expected to support decision making related to management and policy in the procurement and operation of supercomputers.

show abstract

Efficient and high‐quality sparse graph coloring on GPUs

Chen

Fang

et al. 2016

Concurrency and Computation

View full text Add to dashboard Cite

Summary Graph coloring has been broadly used to discover concurrency in parallel computing. To speed up graph coloring for large‐scale datasets, parallel algorithms have been proposed to leverage modern GPUs. Existing GPU implementations either have limited performance or yield unsatisfactory coloring quality (too many colors assigned). We present a work‐efficient parallel graph coloring implementation on GPUs with good coloring quality. Our approach uses the speculative greedy scheme, which inherently yields better quality than the method of finding maximal independent set. To achieve high performance on GPUs, we refine the algorithm to leverage efficient operators and alleviate conflicts. We also incorporate common optimization techniques to further improve performance. Our method is evaluated with both synthetic and real‐world sparse graphs on the NVIDIA GPU. Experimental results show that our proposed implementation achieves averaged 4.1 × (up to 8.9 × ) speedup over the serial implementation. It also outperforms the existing GPU implementation from the NVIDIA CUSPARSE library (2.2 × average speedup), while yielding much better coloring quality than CUSPARSE.

show abstract

A CUDA Implementation of the High Performance Conjugate Gradient Benchmark

Cited by 17 publications

References 7 publications

Performance Optimization of the HPCG Benchmark on the Sunway TaihuLight Supercomputer

Performance Optimization of the HPCG Benchmark on the Sunway TaihuLight Supercomputer

Composite Measures of Supercomputer Technology

Efficient and high‐quality sparse graph coloring on GPUs

Contact Info

Product

Resources

About